Technical 
Recommendations 


for Psychological Tests 
and Diagnostic Techniques 





Vol. 51, No. 2, Part 2 March, 1954 


Supplement to the 


Psychological Bulletin 


Published bimonthly by the 
American Psychological Association 























Technical Recommendations 


for Psychological Tests and Diagnostic Techniques 


Prepared by a joint committee of the American Psychological 
Association, American Educational Research Association, and 
National Council on Measurements Used in Education. 


PUBLISHED BY THE 
AMERICAN PSYCHOLOGICAL ASSOCIATION, INC. 
1333 Sixteenth Street N.W., Washington 6, D. C. 


Entered as second class mail matter at the post office at Washington, D.C., under the act of March 3, 1879 
Additional! entry at the post office at Menasha, Wisconsin. Acceptance for mailing at special rate of postage 
under the provisions ot Sec. 34-40 Par. (D) provided for in Section 538, act of February 25, 1925, authorized 
August 6, 1947. Printed in U.S.A 


Copyright 1954 by the American Psychological Association, Inc. 





Foreword 


This statement has been endorsed by the respective governing bodies 
of the American Psychological Association, the American Educational 
Research Association, and the National Council o9n Measurements Used 
in Education. The original drafts were developed in the APA Com- 
mittee on Test Standards, whose members were Edward S. Bordin, 
R. C. Challman, H. S. Conrad, Lloyd G. Humphreys, Paul E. Meehl, 
Donald E. Super, and Lee J. Cronbach, chairman. The work was modi- 
fied and extended in cooperation with the AERA committee (Jacob S. 
Orleans, chairman, Saul B. Sells, and J. R. Gerberich) together with 
three liaison members (Conrad, Cronbach, and Super) and the NCMUE 
committee, whose successive chairmen have been Gerberich, Henry 
Rinsland, and Robert L. Ebel. An extension of the recommendations to 
cover additional problems related to achievement tests is in preparation. 


The statements presented here were submitted for criticism by special- 
ists in test construction and use, including test publishers, and a pre- 
liminary version was published in the American Psychologist (Amer. 
Psychologist, 1952, 7, 461-475) for wider examination. The present 
statement is the result of successive revisions. 





Development and Scope of the Recommendations 


Psychological and educational tests 
are used in arriving at decisions 
which may have great influence on 
the ultimate welfare of the persons 
tested, and of the community. Test 
users, therefore, wish to apply high 
standards of professional judgment 
in selecting and interpreting tests, 
and test producers wish to produce 
tests which can be of the greatest pos- 
sible service. The test producer, in 
particular, has the task of providing 
sufficient information about each 
test so that users will know what re- 
liance can safely be placed on it. 

Professional workers agree that 
test manuals and associated aids to 
test usage should be made complete, 
comprehensible, and unambiguous, 
and for this reason there have always 
been informal ‘‘test standards.”’ Pub- 
lishers and authors of tests have 
adopted standards for themselves, 
and standards have been stated in 
textbooks and other publications. 
Through application of these stand- 
ards, tests have attained a high de- 
gree of quality and usefulness. 

Until this time, however, there has 
been no statement representing a con- 
sensus as to what information is most 
helpful to the test consumer. In the 
absence of such a guide, it is inevi- 
table that some tests appear with less 
adequate supporting information 
than others of the same type, and 
that facts about a test which some 
users regard as indispensable have not 
been reported because they seemed 
relatively unimportant to the test 
producer. This report is the outcome 
of an attempt to survey the possible 
types of information that test pro- 


(201) 


ducers might make available, to 
weigh the importance of these, and to 
make recommendations regarding 
test preparation and publication. 

Improvement of testing has long 
been a concern of professional work- 
ers. In 1906, an APA committee, 
with Angell as chairman, was ap- 
pointed to act as a general control 
committee on the subject of measure- 
ments. The purpose of their work 
was to standardize testing techniques, 
whereas the present effort is con- 
cerned with standards of reporting in- 
formation about tests. 

In a developing field, it is necessary 
to make sure that standardizing 
efforts do not stifle growth. The 
words of the earlier committee are ap- 
propriate today: 

The efforts of a standardizing committee 
are likely to be regarded with disfavor and 
apprehension in many quarters, on the ground 
that the time is not yet ripe for stereotyping 
either the test material or the procedure. It 
may be felt that what is called for, in the 
present immature condition of individual psy- 
chology, is rather the free invention and the 
appearance of as many variants as possible 
Let very many tests be tried, each new inves- 
tigator introducing his own modification; and 
then, the worthless will gradually be elim- 
inated and the fittest will survive. 


Issuing specifications for tests 
could indeed discourage the develop- 
ment of new types of tests. So many 
different sorts of tests are needed in 
present psychological practice that 
limiting the kind or the specifications 
would not be sound procedure. Ap- 
propriate standardization of tests and 
manuals, however, need not interfere 
with innovation. The recommenda- 
tions presented here are intended to 


1 





2 Technical Recommendations 


assist test producers to bring out a 
wide variety of tests that will be suit- 
able for all the different purposes for 
which tests should be used and to 
make those tests as valuable as pos- 
sible. 


Information Standards as a Guide 
to Producers and Users of Tests 


The essential principle that sets the 
tone for this document is that a test 
manual should carry information suf- 
ficient to enable any qualified user to 
make sound judgments regarding the 
usefulness and interpretation of the 
test. This means that certain research 
is required prior to release of a test 
for general use by psychologists or 
school personnel. The results must 
be reported or summarized in the 
manual, and the manual must help 
the reader to interpret these results. 

A manual is to be judged not 
merely by its literal truthfulness, but 


by the impression it leaves with the 


reader. If the typical professional 
user is likely to obtain an inaccurate 
impression of the test from the man- 
ual, the manual is poorly written. 
Ideally, manuals would be tested in 
the field by comparing the typical 
reader's conclusions with the judg- 
ment of experts regarding the test. 
In the absence of such trials, our rec- 
ommendations are intended to apply 
to the spirit and tone of the manual 
as well as its literal statements. 

A manual must often communicate 
information to many different groups. 
Many tests are used by classroom 
teachers or psychometrists with very 
limited training in testing. These 
users will not follow technical discus- 
sion or statistical information. At the 
other extreme of the group of readers, 
the available information about any 


(202) 


test should be sufficiently complete 
for specialists in the area to judge the 
technical adequacy of the test. Some- 
times the more technical information 
can be presented in a supplementary 
handbook, but it is most important 
that there be made available to the 
person concerned with the test a 
sound basis for whatever judgments 
his duties require. 

The setting of numerical specifica- 
tions has been avoided, even though 
it would have been tempting to say, 
for instance, that a validity coeffi- 
cient ought to reach .50 before a test 
of Type A is ready for use or that a 
test of Type B should always have a 
reliability of .90 before it is used for 
the measurement of individual sub- 
jects. There are different problems in 
different situations, depending, for 
instance, on whether clinical analysis 
or personnel selection is involved, or 
whether preliminary or final deci- 
sions are being made. It is not ap- 
propriate to call for a particular level 
of validity and reliability, or to other- 
wise specify the nature of the test. It 
1s appropriate to ask that the manual 
give the information necessary for the 
user to decide whether the accuracy, 
relevance, or standardization of the 
test makes it suitable for his purposes. 
These recommendations, then, sug- 
gest standards of test description and 
reporting without stating minimum 
statistical specifications. 

The aim of the present standards is 
partly to make the requirements as to 
information accompanying published 
tests explicit and conveniently avail- 
able. In arriving at those require- 
ments, it has been necessary to judge 
what is presently the reasonable de- 
gree of compromise between pressures 
of cost and time, on the one hand, and 





(203) 


the ideal, on the other. The test pro- 
ducer ordinarily spends large sums of 
money in developing and _ stand- 
ardizing a test. Insofar as these rec- 
ommendations indicate the sort of 
information that would be most 
valuable to the people who use tests, 
test authors and publishers can then 
direct their funds to gathering and 
reporting those data. Validation on 
job criteria, for example, is essential 
before a vocational interest inven- 
tory can be used practically, but only 
a desirable addition for a values in- 
ventory, and irrelevant for an inven- 
tory designed to diagnose mental dis- 
orders. The recommendations there- 
fore attempt to state what type of 
studies should be completed before a 
test is ready for release to the pro- 
fession for operational use. The 
recommendations attempt to describe 
standards which are already reached 
by our better tests. 


Tests to Which 
the Recommendations Apply 

These recommendations cover not 
only tests as narrowly defined, but 
also most published devices for diag- 
nosis and evaluation. The  rec- 
ommendations apply to interest in- 
ventories, personality inventories, 
projective instruments and _ related 
clinical techniques, tests of aptitude 
or ability, and achievement tests. 
The same general types of informa- 
tion are needed for all these varieties 
of tests. General recommendations 
have been prepared with all these 
techniques and instruments in mind. 
Since each type of test presents cer- 
tain special requirements, additional 
comments have been made to indi- 
cate specific applications of the 
recommendations to particular tech- 


Development and Scope 3 


niques. Many principles of specific 
importance in measurement of 
achievement remain to be worked out 
in a subsequent statement. 

Tests can be arranged according to 
degree of development. The highest 
degree of development is needed for 
tests distributed for use in practical 
situations where the user is unlikely 
to validate the tests for himself. 
Such a user must assume that the 
test does measure what it is presumed 
to measure on the basis of its title and 
manual. For instance, if a clerical 
aptitude measure is used in voca- 
tional guidance under the assumption 
that this will predict success in office 
jobs, there is very little possibility 
that the counselor could himself vali- 
date the test for the wide range of 
office jobs to which his clients might 
go. 

At the other extreme of the contin- 
uum are tests in the very beginning 
stages of their development. At this 
point, perhaps the investigator is not 
sure whether his test is measuring any 
useful variable. Sometimes, because 
the theory for interpreting the test is 
undeveloped, the author restricts use 
of the test to situations where he him- 
self knows the persons who will use 
the test, can personally caution them 
as to its limitations, and is using the 


research from these trials as a way of 
improving the test. 

Between these tests which are so 
to speak embryonic, and the tests 
which are released for practical appli- 


validation, are 
tests released for somewhat restricted 
use. There are many tests which have 
been examined sufficiently to indicate 
that they will probably be useful 
tools for psychologists, but which are 
released with the expectation that the 


cation without local 





4 Technical Recommendations 


user will conduct validation studies 
against performance criteria, or wiil 
verify suggested clinical interpreta- 
tions by studying the subsequent be- 
havior of persons in treatment. Ex- 
amples are certain tests of spatial 
ability, and some inventories measur- 
ing such traits as introversion. 

The present recommendations apply 
to devices which are distributed for use 
as a basis for practical judgments 
rather than solely for research. Most 
tests which are made available for use 
in schools, clinics, and industry are of 
this practical nature. Tests released 
for operational use should be pre- 
pared with the greatest care. They 
should be released to the general user 
only after their developer has 
gathered information which will per- 
mit the user to know for what use the 
test can be trusted. These state- 
ments regarding recommended infor- 
mation apply with especial force to 
tests distributed to users who have 
only that information about the test 
which is provided in the manual and 
other accessories. In the preparation 
of the recommendations, no attention 
was paid to tests which are privately 
distributed and circulated only to 
specially trained users. The recom- 
mendations also do not apply to tests 
presented in journal articles unless 
the article is intended to fulfill the 
functions of a manual. 

A brief discussion of problems of 
projective techniques is needed here 
because of the opinion occasionally 
voiced that these devices are so unlike 
other testing procedures that they 
cannot be judged according to the 
same standards. 

Many users of projective devices 
aim at idiographic analysis of an indi- 
vidual. Since this kind of analytical 


(204) 


thinking places heavy reliance on the 
creative, artistic activity of the cli- 
nician, not all of this process can be 
covered in test standards. Thus, the 
‘ecommendations herein presented 
are necessarily of a psychometric na- 
ture and should not be interpreted as 
meaning that projective techniques 
are intended primarily for such use. 
Nevertheless, proposals for arriving 
at such unique idiographic interpre- 
tations are almost always partially 
based upon some nomothetic prem- 
ises, e.g., that a Rorschach determi- 
nant tends to correlate with a speci- 
fied internal factor. There is no 
justification for failure to apply the 
usual standards in connection with 
these premises. Therefore, although 
these devices present unusual prob- 
lems, the user of projective tech- 
niques requires much of the same 
information that is needed by users of 
other tests. ; 

Even though the data from projec- 
tive tests are more often qualitative 
than quantitative, these devices 
should be accompanied by appropri- 
ate evidence on validity, reliability, 
and so on. A projective test author 
need not identify his test’s validity 
by correlating it with any simple cri- 
terion. But if he goes so far as to 
make any generalization about what 
‘‘most people see’’ or what ‘‘schizo- 
phrenics rarely do,” he is making an 
out-and-out statistical claim and 
should be held to the usual rules for 
backing it up. Obviously, when 
quantitative information is asked for 
in the recommendations, it is ex- 
pected to apply where a quantitative 
kind of claim has been made. If a 
projective test makes no such claim, 
a recommendation would not be 
meaningful for it. 





(205) 


On the other hand, clinicians some- 
times forget that the words ‘‘more,”’ 
“usual,” “typical,” and the like are 
quantity words. Any textual dis- 
course containing such words, or any 
verbal statement describing a corre- 
spondence between test performance 
and personality structure is making a 
quantitative claim. The only differ- 
ence between such a verbal statement 
and a statistical table is the relative 
exactness of the latter. For this rea- 
son, many of the recommendations 
apply to aspects of projective instru- 
ments for which verbal rather than 
numerical interpretations are sug- 
gested. 

The general topics to be covered in 
the recommendations are Dissemina- 
tion of Information, Interpretation, 
Validity, Reliability, Administration, 
and Scales and Norms. 


Many comments have been made 
to amplify and illustrate the recom- 


mendations. Tests mentioned in the 
comments have not been singled out 
as being particularly good or poor 
tests. The tests used for illustrative 
purposes were chosen because they 
are widely known, except where some 
less prominent test provides an un- 
usually clear illustration of the point 
under discussion. These references to 
tests are not intended as critical evalu- 
ations of the test as a whole and 
should not be quoted or referred to 
in test advertising. 


Three Levels of Recommendations 


Manuals can never give all the in- 
formation that might be desirable, 
because of economic limitations. At 
the same time, restricting this state- 
ment of recommendations to essential 
information might tend to discourage 
reporting of additional information. 


Development and Scope BS) 


To avoid this, recommendations are 
grouped in three levels: ESSENTIAL, 
VERY DESIRABLE, and DESIRABLE. 
Each proposed requirement is judged 
in the light of its importance and the 
feasibility of attaining it. 

The statements listed as ESSENTIAL 
are intended to be the consensus of 
present-day thinking as to what is 
normally required for operational use 
of a test. Any test presents some 
unique problems, and it is undesir- 
able that standards should bind the 
producer of a novel test to an in- 
appropriate procedure or form. of 
reporting. The ESSENTIAL standards 
indicate what information will be 
genuinely needed for most tests in 
their usual applications. When a test 
producer fails to satisfy this need, he 
should do so only as a considered 
judgement. In any single test, there 
will be very few ESSENTIAL standards 
which do not apply. 

If some type of ESSENTIAL infor- 
mation is not available on a given 
test, it is important to help the 
that the research 
is incomplete in this 
A test manual can satisfy 
all the ESSENTIAL standards by clear 
statements of what research has 
and has not been done and by avoid- 
ance of misleading statements. It 
will not be necessary to perform much 
additional research to satisfy the 
standards, but only to discuss the 
test so that the reader fully under- 
stands what is known (and unknown) 
about it. 

The category VERY DESIRABLE is 
used to draw attention to types of in- 
formation which contribute greatly 
to the user’s understanding of the 
test. They have not been listed as 
ESSENTIAL for various reasons. For 


reader recognize 


on the test 


respect. 





6 Technical Recommendations 


example, if it is very difficult to ac- 
quire information (e.g., long-term 
follow-up), it cannot always be ex- 
pected to accompany the test. At 
times a closely reasoned minority 
opinion regards a type of informa- 
tion as unimportant. Such informa- 
tion is still very desirable, since 
many users wish it, but it is not 
classed as ESSENTIAL so long as its 
usefulness is debated. 

The category DESIRABLE includes 
information which would be helpful, 
but less so than the ESSENTIAL and 
VERY DESIRABLE information. Test 
users welcome any information of 
this type the producer offers. 

When a test is widely used, the 
producer has a greater responsibility 
for investigating it thoroughly and 
providing more extensive reports. 
The larger sale of such tests makes 
such research financially possible. 


Therefore the producer of a popular 


test can add more of the VERY 
DESIRABLE and DESIRABLE informa- 
tion in subsequent editions of the 
manual. For tests having limited 
sale, it is unreasonable to expect that 
as much of these two categories of 
information will be furnished. In 
making such facts available, the pro- 
ducer performs a service beyond the 
level that can reasonably be antici- 
pated for most tests at this time. 


The Audience for These 
Recommendations 

These recommendations are in- 
tended to guide test development and 
reporting. A good deal of the in- 
formation to be reported about tests 
is technical, and therefore the word- 
ing of the recommendations is of 
necessity technical. They should 
be meaningful to readers who have 


(206) 


had a minimum of one substantial 
course in tests and measurements. 

One audience for the recommenda- 
tions is the authors and publishers 
who are responsible for test develop- 
ment. The recommendations should 
also aid the thinking of test users 
working either in psychology or edu- 
cation. It is not expected that the 
classroom teacher who has not had a 
course in tests and measurements will 
himself use this report. The report 
should, however, be helpful to direc- 
tors of research, school psychologists, 
counselors, supervisors, and admin- 
istrators who select tests to use for 
various school purposes. 

As an aid to test development, the 
recommendations provide a kind of 
check list of factors to consider in 
designing standardization and _ vali- 
dation studies. Test authors should 
refer to them in deciding what studies 
to perform on their tests and how to 
report them in their manuals. ‘Test 
publishers will be able to use them in 
planning revision of their present 
tests. In considering proposed man- 
uals, publishers can suggest to au- 
thors the types of information which 
need to be gathered in order to make 
the manual as_ serviceable as it 
should be. Because of the ease with 
which such claims could be misin- 
terpreted, it would not be appropriate 
to state in a test manual that it 
“satisfies” or ‘‘follows’’ these Techni- 
cal Recommendations. There would 
be no such objection to a statement 
that an author had “attempted to 
take into account or considered” 
these recommendations in preparing 
the manual. 

Almost any test can be useful for 
some functions and in some situa- 
tions. But even the best test can 





(207) 


have damaging consequences if used 
inappropriately. Therefore, ultimate 
responsibility for improvement of 
testing rests on the shoulders of test 
users. These recommendations should 
serve to extend the professional train- 
ing of these users so that they will 
make better use of the information 
about tests and the tests themselves. 
The recommendations draw atten- 
tion to recent developments in think- 
ing about tests and test analysis. The 
report should serve as a reminder re- 
garding features to be considered in 
choosing tests for a particular pro- 
gram. 

Professional thinking about tests is 
much influenced by test reviews, 
textbooks on testing, and courses in 


measurement. These recommenda- 


tions may be helpful in improving 
such aids, for instance, by suggesting 
features especially significant to ex- 


amine in a test review. The recom- 
mendations can be a teaching aid in 
measurement courses. It is im- 
portant to note that publication of 
superior information about tests by 
no means guarantees that tests will 
be used well. The continual improve- 
ment of courses which prepare test 
users and of leadership in all institu- 
tions using tests is a responsibility in 
which everyone must share. 


Development and Scope 


Revision and Extension 


For many reasons, it will be neces- 
sary to revise the recommendations 
periodically.!. Despite the care with 
which the standards have been de- 
veloped, experience will no doubt re- 
veal that some of our judgments 
would benefit from further examina- 
tion. New tests will present problems 
not considered in the present work. 
The improvement of statistical tech- 
niques and psychometric theory will 
yield better bases for test analvsis. 
The efforts of test producers will lead 
to continued improvement in tests, 
and as this continues it will be pos- 
sible to raise the standards so that 
the test user will have ever better 
information about his tools. 

The recommendations here pre- 
sented are intended to be used with- 
out reference to any enforcement 
machinery. The statement will be 
used by individual members of the 
professions to improve their own 
work, 


Continuing committees of the three asso- 
ciations are being established to receive com- 
ments on the recommendations and to plan 
appropriate revision. The 1953-1954 APA 
Committee on Test Standards consists of 
Edward S. Bordin, chairman, Paul E. Meehl, 
David Tiedeman, Jacob S. Orleans, ex officio, 
and R. L. Ebel, ex officio. 





The Recommendations 


A. Dissemination of Information 


The test user needs information to 
help him select the test which is most 
adequate for a given purpose. He 
must rely in large part on the test 
producer for such data. The prac- 
tices in furnishing the needed _ in- 
formation have varied. In the case 
of some tests, the user has had access 
to virtually nothing beyond direc- 
tions for administering and scoring 
the test, and norms of uncertain 
origin. On the other hand, other 
tests have manuals which furnish 
extensive data on the development 
of the test, its validity and reliability, 
the origin of the norms, the kinds of 
interpretations which are appropri- 
ate, and the uses for which it can 


be employed. The diversity of prac- 


tice in making information about 
tests available suggests the need for 
standards for the dissemination of 
information. 

A 1. When a test is published for 
operational use, it should be accom- 
panied by a manual which takes cog- 
nizance of the detailed recommenda- 
tions in this report. ESSENTIAL 


[Comment: Sometimes information 
needed to support interpretations sug- 
gested in the manual cannot be presented 
at the time the manual is published. The 
manual satisfies the intent of recom- 
mendation A 1 if it points out the ab- 
sence and importance of this information. 

It should be recognized that a recom- 
mendation may not apply to a particular 
test. The manual writer ‘‘takes cog- 
nizance of’’ the recommendation if he 
examines it with care to make certain 
whether it has implications for his test. 
It is not proper to ignore a recommenda- 


8 


tion merely because the recommenda- 
tion, while applicable to claims made for 
the test, is difficult to meet or has ordi- 
narily not been met by similar tests.] 


A 1.1 Some form of manual, pre- 
senting at least minimum informa- 
tion, should be given or sold to all 
purchasers of the test. ESSENTIAL 

A 1.2 Where the information is 
too extensive to be fully reported in 
such a manual, the manual should 
summarize the ESSENTIAL informa- 
tion and indicate where further de- 
tails may be found. ESSENTIAL 


[Comment: The Differential Aptitude 
Tests provide an extensive manual, and 
also make further research data avail- 
able through the American Documenta- 
tion Institute. A great deal of the in- 
formation about the Stanford-Binet is 
included in a book which all users must 
have. The Strong Vocational Interest 
Blank has been the subject of unusually 
thorough research which is reported in a 
technical book; a brief version of the 
ESSENTIAL information is given in a 
manual sold with the blanks. 

For many projective techniques, such 
as the Rorschach and TAT, publications 
by persons other than the test author 
fulfill many functions of a manual. In- 
sofar as a book about a technique fulfills 
the functions of a manual, the author has 
the same responsibility in preparing it as 
does the original author of the test.] 


A 1.3 If information about the 
test is provided in a separate pub- 
lication, any such publication should 
meet the same standards of accuracy 
as apply to the manual. ESSENTIAL 


[Comment: A report in a professional 
journal, for instance, on the validity of an 


(208) 





(209) 


instrument should meet the same stand- 
ards of completeness and freedom from 
misleading impressions as a report in the 
manual. Recommendation A 1.3 ap- 
plies also to advertising literature.] 


A 2. The manual should be up-to- 
date. It should be revised at appro- 
priate intervals. ESSENTIAL 


the 
may be 
Also, the norms may require 
Thus, a change in school ob- 
jectives which places increased emphasis 
on problem solving in algebra, rather 
than on factoring and other mechanics, 
could appreciably affect the validity of an 
algebra aptitude test. It would also alter 
the norms for an algebra achievement 
test.] 


[Comment: As criteria change, 
predictive validity of a test 
altered. 
revision. 


A2.1 When new — information 
emerges, from investigations by the 
test author or others, which indicates 
that some facts and recommenda- 
tions presented in the manual are 
substantially incorrect, a revised 
manual should be issued at the 
earliest feasible date. ESSENTIAL 

[Comment: A revised manual for the 
Army Beta which arose out of World War 
I was issued in 1946. In contrast, al- 
though extensive published research 
points out the need for altering state- 
ments made in the manual of the Bern- 
reuter Personality Inventory, no revised 
edition of that manual has been prepared. 
Likewise, the 1943 manual for the TAT 
has not been revised despite extensive 
development in the field since that date.] 


A 2.2 When a test is revised or a 
new form is prepared, the manual 
should be thoroughly revised to take 
changes in the test into 
ESSENTIAL 


account. 


{Comment: The Wechsler-Bellevue 
Scale was modified in several respects 


Interpretation 9 


in the third edition of the manual. For 
example, the directions and scoring pro- 
cedure were altered. The norms should 
have been reviewed or redetermined. In- 
stead, the earlier tables for converting 
scores to IQ were carried over, without 
change, to the new edition.] 


A 2.21 When a short form of a 
test is prepared by reducing the num- 
ber of items or organizing a portion of 
the test into a separate form, new 
evidence should be obtained and re- 
ported for that new form of the test. 
VERY DESIRABLE 


[Comment: This is especially im- 
portant for inventories, where placing 
items in a new context might alter the 
person's responses. For example, the 
MMPI properly retains some items 
which were not scored in any key, because 
removing those items might alter the 
discriminating power of the items which 
were scored.]| 


A 2.22 When a short form is pre- 
pared from a test, the manual should 
present the correlation between the 
long and short forms, separately ad- 


ministered. DESIRABLE 


A 2.3 The copyright date of the 
manual or the date of the latest re- 
vision should be clearly indicated. 
ESSENTIAL 


B. Interpretation 


In interpreting tests, the user al- 
ways is responsible for making in- 
ferences as to the meaning and 
legitimate uses of test results. In 
making such judgments, he is de- 
pendent upon the available data 
about the test. 

The degree to which a test manual 
can be expected to prepare the user 
for accurate interpretation and effec- 
tive use of the test varies with the 





10 Technical Recommendations 


type of test and the purpose for 
which it is used. For any test, it is 
sometimes necessary to make judg- 
ments which have not been sub- 
stantiated by the published evidence. 
Thus the vocational counselor can- 
not expect to have regression equa- 
tions available for predictions he 
must make from test scores, and the 
clinician must interpret a personality 
inventory on the basis of general 
data and theory because research on 
any one instrument is incomplete. 
The manual of a projective test can- 
not fully prepare the user for in- 
terpretation. Test users should be 
wary of interpreting projective test 
results without supervised training 
with that device and instruction in 
the clinical concepts and data which 
are part of its background. 

This problem of accuracy is not 
the only consideration related to test 
interpretation. An equally important 
concern is the examinee’s reactions to 
interpretations of his test scores, if 
the interpretation is made to him. 
Many educational and clinical uses 
of tests require reporting the in- 
terpretations to the person tested. 
The teacher who interprets the re- 
sults of academic achievement tests 
affects the student’s self concept and 
future learning. The clinician, in 
making interpretations which bear 
upon the client’s areas of conflict, 
may unwittingly intensify those con- 
flicts. 

B 1. Insofar as possible, the test, 
the manual, record forms, and other 
accompanying material should assist 
users to make correct interpretations 
of the test results. ESSENTIAL 

B 1.1 Names given to tests, and 
to scores within tests, should be 
chosen to minimize the risk of mis- 


(210) 


interpretation by test purchasers and 
subjects. ESSENTIAL 


{Comment: The Army General Clas- 
sification Test, the Blacky Test, and the 
Draw-A-Person Test are examples of 
names based on the content or process 
involved in the test which carry no un- 
warranted suggestions as to character- 
istics measured. Such names as “‘culture- 
free test,’’ “primary abilities test,”’ 
‘measure of mental growth,” and ‘‘tem- 
perament test” are likely to suggest in- 
terpretations going beyond the demon- 
strable meaning of test scores. 

Names designed to disguise the pur- 
pose of a test from a subject may prop- 
erly be used. In such a case, the manual 
should contain in an early and con- 
spicuous place an explanation of the 
reason for choosing this name and a 
statement of what in fact the test is sup- 
posed to measure. 

B 1.1 and subordinate recommenda- 
tions can be followed in developing new 
tests, but it will rarely be feasible to re- 
name established tests, even when this 
would be desirable.] 


B 1.11 Interest and personality in- 
dices based on the self-report prin- 
ciple should be called “inventories,” 
‘“‘questionnaires,”’ or the like, rather 
than ‘‘tests.”” ESSENTIAL 

B 1.2 The manual or other ac- 
companying material should describe 
the process by which interpretations 
are to be derived from test scores. 
VERY DESIRABLE 


[Comment: The manual need not in- 
clude such information as all profes- 
sionally qualified users may be expected 
to have. The original manual for the 
Differential Aptitude Tesis presented a 
few profiles and gave an interpretation 
and a too brief case summary for each 
one. Later, more extensive case reports 
were reported in Counseling from Pro- 
files, a supplementary booklet on the 
test, and the sketchy profiles were re- 





(211) 


moved from the manual. The case re- 
ports avoid oversimplification and em- 
phasize the possible influence of non- 
test data on test interpretation. 

The Atlas for the MMPI makes avail- 
able for study examples of a variety of 
complex personality profiles. Few other 
personality inventories are supplemented 
by such materials as aids in their in- 
terpretation.] 


B 1.21 The manual should draw 
the user’s attention to data other 
than the test scores which need to be 
taken into account in interpreting the 
test. VERY DESIRABLE 


{[Comment: For example, Murray's 
TAT manual states that “the psycholo 
gist should know 


the following basic 


facts: the sex and age of the subject, 
whether his parents are dead or sepa- 
rated, the ages and sexes of his siblings, 
his vocational and his marital status.’’] 


B 1.22 When case studies are used 
as illustrations for the interpretations 
of test scores, the examples presented 
should include some relatively com- 
plicated cases whose interpretation is 
not clear-cut. VERY DESIRABLE 

B 1.23 Where a certain misinter- 
pretation of a given test is known to 
be frequently made (or can reason- 
ably be anticipated in the case of a 
new test), the manual should draw 
attention to this error and 
against it. ESSENTIAL 


warn 


{Comment: Since the  Terman- 
McNemar Test of Mental Ability reports 
scores in terms of a deviation IQ rather 
than a ratio IQ, it discusses at some 
length the fact that deviation IQ's do not 
have the same properties as ratio IQ's. 
Complete avoidance of the term IQ for 
deviation scores would be a more certain 
way to avoid confusion. 

Another common misconception is 
that intelligence tests are measures of 
inherent native ability alone; it would be 


Inter pretation 11 


desirable for manuals of such tests to 
caution against this interpretation. 
Manuals for interest measures should 
make clear, and urge counselors to stress 
to the client, the fact that interest does 
not imply ability and is only one factor 
to be considered in 
occupations. 


choosing among 
A desirable caution of this 
type is found in the Lee-Thorpe Occupa- 
tional Interest Inventory.] 


B 2. The test manual should state 
explicitly the purposes and applica- 
tions for which the test is recom- 
mended. ESSENTIAL 

B 2.1 If a test is intended for re- 
search use only, and is not dis- 
tributed for operational use, that 
fact should be prominently stated 
in the accompanying materials. 
SENTIAL 


ES- 


[Comment: If, for example, an investi- 
gator plans to release tests developed by 
factor analysis for research use, it would 
be appropriate to print ‘distributed for 
research use only” on the test package or 
cover of the booklet of directions. This 
would serve to caution against premature 
use of the tests in guidance.] 


B 3. The test manual should indi- 
cate the professional qualifications 
required to administer and interpret 
the test properly. bSSENTIAL 

B 3.1 Where a_ test is”, recom- 
mended for a variety of purposes or 
types of inference, the manual should 
indicate the amount of-training re- 
quired for each use. ESSENTIAL 

[Comment: One suggested categoriza- 
tion of tests approved by the APA is as 
follows? 

Level A. Tests or aids which can ade- 

> APA Code of Standards for Test Distribu- 
tion, American Psychologist, November, 1950. 
This statement also includes descriptions of 
general levels of training which correspond to 
the three levels of tests. 





12 Technical Recommendations 


quately be administered, scored, and in- 
terpreted with the aid of the manual and 
a general orientation to the kind of or- 
ganization in which one is working. 
(E.g., achievement or proficiency tests.) 

Level B. Tests or aids which require 
some technical knowledge of test con- 
struction and use, and of supporting psy- 
chological and educational subjects such 
as statistics, individual differences, and 
psychology of adjustment, personnel psy- 
chology, and guidance. (E.g., aptitude 
tests, adjustment inventories with nor- 
mal populations.) 

Level C. Tests and aids which require 
substantial understanding of testing and 
supporting psychological subjects, to- 
gether with supervised experience in the 
use of these devices. (E.g., projective 
tests, individual mental tests.) 

The manual might identify a test ac- 
cording to one of the foregoing levels, or 
might employ some form of statement 
more suitable for that test. Regarding a 
particular industrial personnel test, the 
manual might say: ‘‘This test can be ad- 
ministered and scored by an intelligent 
clerical employee, but decisions regarding 
hiring and related interpretations should 
be made only by a psychologist or per- 
sonnel manager who has studied funda- 
mental statistics including correlation. 
Only a vocational counselor with special- 
ized graduate training should use the 
test for vocational guidance.”’} 


B 3.11 The manual should not 
imply that the test is “‘self-interpret- 
ing,” or that it may be interpreted 
by a person lacking proper training. 
ESSENTIAL 

B 3.12 The manual should point 
out the counseling responsibilities as- 
sumed when a tester communicates 
interpretations about ability or per- 
sonality traits to the person tested. 
ESSENTIAL 


{Comment: While examinees may prop- 
erly score their own interest inventories 


(212) 


and examine their own profiles, the 
Manual for the Kuder Preference Rec- 
ord properly recommends that they 
should make interpretations and future 
plans only with professional help in in- 
dividual or group counseling situations.]} 


B 3.2 The manual should draw 
attention to references dealing with 
the test in question with which the 
user should become familiar before 
attempting to interpret the test. 
The statement should avoid the im- 
plication that this constitutes the 
only training needed, if other training 
is required. VERY DESIRABLE 

B 4. When a test is issued in re- 
vised form, the nature and extent of 
any revision, and the comparability 
of data for the revised and the old 
test should be explicitly stated. 
SENTIAL 


ES- 


{[Comment: An example of desirable 
practice is found in the manual for the 
revised edition of the Study of Values.] 


B 5. Statements inthe manual re- 
porting relationships are by implica- 
tion quantitative, and should be 
stated as precisely as the data per- 
mit. If data to support such a state- 
ment have not been collected, that 
fact should be made clear. ESSENTIAL 


[Comment: Writers sometimes say, for 
example, “Spatial ability is required for 
architectural engineering” or, ‘Bizarre 
responses often indicate schizophrenic 
tendencies.’’ Such statements need to be 
made more definite. In what proportion 
of cases giving bizarre responses does 
schizophrenia develop? How much does 
architectural success depend upon spatial 
ability? Numerical data would provide 
the needed answer.] 


B 5.1 When the term “significant”’ 
is employed, the manual should make 
clear whether statistical or practical 





(213) 


significance is meant, and the practi- 
cal significance of statistically reliable 
differences should be evaluated. 
SENTIAL 

B 5.2 The manual should clearly 
differentiate between an interpreta- 
tion justified regarding a group 
taken as a whole, and the applica- 
tion of such an interpretation to each 
individual within the group. 
SENTIAL 


ES- 


ES- 


[Comment: For example, if the stand- 
ard error of measurement is five points, 
this statement should not be presented 
so as to imply that the obtained score for 
any one individual is within five points 
of his true score. For a single pupil, the 
difference between the obtained and true 
score might be very much larger.] 


C. Validity 

Validity information indicates to 
the test user the degree to which the 
test is capable of achieving certain 
aims. Tests are used for several 
types of judgment, and for each type 
of judgment, a somewhat different 
type of validation is involved. We 
may distinguish four aims of testing: 

1. The test user wishes to deter- 
mine how an individual would per- 
form at present in a given universe of 
situations of which the test situation 
constitutes a sample. 

2. The test user wishes to predict 
an individual’s future performance 
(on the test or on some external vari- 
able). 

3. The test user wishes to estimate 
an individual’s present status on 
some variable external to the test. 

4. The test user wishes to infer the 
degree to which the individual pos- 
sesses some trait or quality (con- 
struct) presumed to be reflected in 
the test performance. 


Validity 13 


Thus, a vocabulary test might be 
used simply as a measure of present 
vocabulary, as a predictor of college 
success, as a means of discriminating 
schizophrenics from organics, or as a 
means of making inferences about 
“intellectual capacity.” 


Four Types of Validity 


To determine how suitable a test 
is for each of these uses, it is neces- 
sary to gather the appropriate sort of 
validity information. These four 
aspects of validity may be named 
content validity, predictive validity, 
concurrent validity, and construct 
validity. 

a. Content validity is evaluated by 
showing how well the content of the 
test samples the class of situations or 
subject matter about which conclu- 
sions are to be drawn. Content 
validity is especially important in the 
case of achievement and proficiency 
measures. 

In most classes of situations meas- 
ured by tests, quantitative evidence 
of content validity is not feasible. 
However, the test producer should 
indicate the basis for claiming ade- 
quacy of sampling or representative- 
ness of the test content in relation to 
the universe of items adopted for 
reference. 

b. Predictive validity is evaluated 
by showing how well predictions 
made from the test are confirmed by 
evidence gathered at some subse- 
quent time. The most common 
means of checking predictive validity 
is correlating test scores with a subse- 
quent criterion measure. Predictive 
uses of tests include long-range pre- 
diction of intelligence measures, pre- 
diction of vocational success, and 
prediction of reaction to therapy. 





14 Technical Recommendations 


ce. Concurrent validity is evalu- 
ated by showing how well test scores 
correspond to measures of concurrent 
criterion performance or status. 
Studies which determine whether a 
test discriminates between presently 
identifiable groups are concerned 
with concurrent validity. Concurrent 
validity and predictive validity are 
quite similar save for the time at 
which the criterion is obtained. 
Among the problems for which con- 
current validation is used are the 
validation of psychiatric screening 
instruments against estimates of ad- 
justment made in a psychiatric in- 
terview, differentiation of vocational 
groups, and classification of patients. 
It should be noted that a test having 
concurrent validity may not have 
predictive validity. 

d. Construct validity is evaluated 
by investigating what psychological 
qualities a test measures, i.e., by 
demonstrating that certain explana- 
tory constructs account to some de- 
gree for performance on the test. To 
examine construct validity requires 
both logical and empirical attack. 
Essentially, in studies of construct 
validity we are validating the theory 
underlying the test. The validation 
procedure involves two steps. First, 
the investigator inquires: From this 
theory, what predictions would we 
make the variation of 
scores from person to person or oc- 
Second, he 
gathers data to confirm these predic- 
tions, 

There are pro- 
cedures for gathering data on con- 
struct validity. If it is supposed 
that form perception on the Ror- 
schach test indicates probable ability 


regarding 


casion to occasion? 


various specific 


to resist stress, this supposition may 
be validated by placing individuals 


(214) 


in an experimental stress situation 
and observing whether behavior cor- 
responds to prediction. Another 
much simpler procedure for investi- 
gating what a test measures is to 
correlate it with other measures; we 
would expect a valid test of numeri- 
cal reasoning, for example, to be sub- 
stantially correlated with other nu- 
merical tests, but not to be correlated 
with a clerical perception test. Fac- 
tor analysis is another way of or- 
ganizing data about construct valid- 
ity. 

We can distinguish among the four 
types of validity by noting that each 
involves a different emphasis on the 
criterion. In predictive or concurrent 
validity, the criterion behavior is of 
concern to the tester, and he may 
have no concern whatsoever with the 
type of behavior exhibited in the 
test. (An employer does not care 
if a worker can manipulate blocks, 
but the score on the block test may 
predict something he cares about.) 
Content validity is studied when the 
tester zs concerned with the type of 
behavior involved in the test per- 
formance. Indeed, if the test is a 
work sample, the behavior repre- 
sented in the test may be an end in 
itself. Construct validity is ordi- 
narily studied when the tester has 
no definitive criterion measure of the 
quality with which he is concerned, 
and must use indirect measures to 
validate the theory. Here the trait 
or quality underlying the test is of 
central importance, rather than 
either the test behavior or the scores 
on the criteria. 

It is ordinarily necessary to evalu- 
ate construct validity by integrating 
evidence from many different sources. 
The problem of construct validation 
becomes especially acute in the clini- 





(215) 


cal field since for many of the con- 
structs dealt with it is not a question 
of finding an imperfect criterion but 
of finding any criterion at all. The 
psychologist interested in construct 
validity for clinical devices is con- 
cerned with making an estimate of a 
hypothetical internal process, factor, 
system, structure, or state and can- 
not expect to find a clear unitary be- 
havioral criterion. Concern for val- 
idity is in no way a challenge to 
the dictum that prediction of be- 
havior is the final test of any theoret- 
ical construction. But it is necessary 
to understand that behavior-relevance 
in a construct is not logically the 
same as behavior-equivalence. It is 
one thing to insist that in order to be 
admissible, a complex psychological 
construct must have some relevance 
to behavioral indicators; it is quite 
another thing to require that any 
admissible psychological construct 
must be equivalent to any direct oper- 
ational behavior measure. Any posi- 
tion that cuts the test inference off 
from all possible nontest sources of 
confirmation appears to be an un- 
reasonable one. If the test is to be 
interpreted in terms of internal con- 
structs, there must 
quantitative or 


be some facts, 
not, that would 


argue for the existence of the partic- 


ular internal system postulated. An 
attempt to identify any one criterion 
measure or any composite as the 
criterion aimed at is, however, usu- 
ally unwarranted. 

This viewpoint, while fraught with 
grave dangers and sometimes mis- 
used, is nevertheless methodologi- 
cally sound. The clinician interested 
in construct validity has in mind an 
admittedly incomplete construct, the 
evidence for which is to be found 
roughly in such-and-such behavioral 


Validity 15 


The 


construct is an 


domains. vagueness of the 
inevitable conse- 
quence of the incompleteness of cur- 
rent psychological theory, and can- 
not be rectified faster than theory 
grows and is confirmed. At a given 
stage of theoretical development, the 
only kind of prediction that can be 
made may be that certain correlations 
should be positive, or that patients 
who fail to conform to a group trend 
should be expected with considerable 
frequency to exhibit such-and-such 
an additional feature, or the like. Itis 
clear that these deductions do in- 
volve behavioral prediction. They 
require the test-constructs to be 
behaviorally relevant. But they still 
do not necessarily identify any of the 
test-inferred constructs or variables 
with any criterion measure. A 
clinician may say, “‘I expect to find 
cases of psychosomatic ulcer show- 
ing large discrepancies between latent 
n Succorance as inferred from TAT 
stories and manifest n Succorance as 
revealed by the score on a question- 
naire.’’ Such a declaration leads to 
an empirical test. 

The correlation or measure of dis- 
crimination obtained in studying con- 
struct validity is not to be taken as 
the ‘‘validity coefficient,”” in the 
same sense that prediction of wash- 
outs during flight training is the 
validity coefficient for the battery 
emploved. Studies of many such 
predictions, possibly involving quite 
independent components of theory, 
will in the mass confirm or disconfirm 
the claims made. 

One tends to ask regarding con- 
struct validity just what 
validated——the test or the underlying 
hypothesis? The answer is, both, 
simultaneously. If one predicts an 
empirical relation by supposing a cer- 


is being 





16 Technical Recommendations 


tain personality organization, the 
verification of this prediction tends 
to confirm both the component sup- 
positions that gave rise to it. True, 
there might be plausible alternative 
hypotheses, but this is always the 
case in science. The more alterna- 
tives there are, the more cumulated 
evidence is needed to justify con- 
fidence in the particular _ test- 
hypothesis pair. A further charac- 
teristic of this type of validity in- 
ference is that the construct itself 
undergoes modification as evidence 
accumulates. We do not merely 
alter our confidence in the correct- 
ness of the construct, or in the esti- 
mates of its magnitudes, but we 
actually reformulate or clarify our 
characterization of its nature on the 
basis of new data. 

It must be kept in mind that these 
four aspects of validity not 
all discrete and that a complete 
presentation about a test may in- 
volve information about all types of 
validity. A first step in the prepara- 
tion of a predictive instrument may 
be to consider what constructs or 
predictive dimensions are likely to 
give the best prediction. Examining 
content validity may also be an early 
step in producing a test whose pre- 
dictive validity is ultimately of major 
concern. Even after satisfactory pre- 
dictive validity has been established, 
information relative to construct 
validity may make the test more use- 
ful. To analyze construct validity, 
our total background of knowledge 
regarding validity would be brought 
to bear. 


are 


A pplication of the Concepts to Ability 
Tests 


Several examples of the applica- 
tion of these principles to intelligence 


(216) 


tests should clarify the concepis in- 
volved. Correlations between an 
intelligence test used to select uni- 
versity students and later academic 
success are predictive validities. Such 
correlations will typically vary in 
size from those with criteria of pro- 
ficiency in art or music at the lower 
end to those with grades in science 
at the upper end. If the test is used 
to predict an art criterion, then the 
correlation obtained, even though 
low, is the predictive validity of the 
test. Even if validities of intelligence 
tests are corrected for attenuation, a 
value substantially less than unity 
is the usual result. This is not in- 
terpreted, when predictive validity 
is at issue, that the criterion is an 
imperfect index of _ intelligence. 
Rather the test is regarded as an im- 
perfect index of the criterion. 
Relationships of subscores on an 
intelligence test to membership in 
various clinical groups are an ex- 
ample of evidence concerning con- 
current validity. Again, low or im- 
perfect validities are interpreted as 
due to inadequacies in the test as a 
discriminating device. <A _ test is 
likely to be developed for making dis- 
criminations if it is difficult to meas- 
ure status on a criterion directly. If 
the direct measurement of the cri- 
terion is expensive, dangerous, or 
highly unreliable, tests having con- 
current validity are needed to assess 
status on the criterion indirectly. 
Content validity is indicated by a 
description of the universe of items 
from which selection was made, 
including a description of the selec- 
tion process. The universe of items 
in intelligence test construction is 
usually defined by the types of items 
used originally by Binet. Judges’ 
ratings of appropriateness of items 





(217) 


are frequently involved. Content 
validity is ordinarily of little direct 
interest to the user of intelligence 
tests. The distinction between verbal 
and nonverbal tests of intelligence is, 
however, based on content analysis. 

Construct validity may be judged 
from all of the information ordinarily 
subsumed under the preceding cate- 
gories. Certain types of information, 
however, are employed here alone. 
Examples are as follows: correlations 
with other tests of intelligence, cor- 
relations with ratings of intelligence, 
factor analyses, nature-nurture 
studies, and studies of the effects of 
practice upon test scores. All relate 
to the problem of the meaning of the 
concept of intelligence. From this 
point of view a low correlation of the 
test with athletic ability may be just 
as important and encouraging as a 
high correlation with reading com- 
prehension. This reverses the earlier 
emphasis from the viewpoint of con- 
current or predictive validity, where 
a low correlation indicated weakness 
in the test. 

Information concerning construct 
validity is of help to the theorist in 
formulating hypotheses concerning 
individual differences and to the test 
constructor in improving intelligence 
tests. For the practical test user this 
information is most frequently used 
to generalize beyond established pre- 
dictive and concurrent validities. 
The careful verification of theory 
should serve to reduce the errors of 
extrapolation, but does not reduce 
the necessity of objective check upon 
extrapolations whenever possible. 
Application of the Concepts to Per- 

sonality Inventories 

Evidence of the predictive validity 


of personality questionnaires pro- 


Validity 17 


vides the basis for their use for screen- 
ing. One of the screening uses is to 
identify persons who will become 
maladjusted (as in the armed serv- 
ices). If personality instruments are 
used as a basis for predicting voca- 
tional or educational achievement, 
this inference also rests upon predic- 
tive validity. 

Evidence of concurrent validity 
supports the use of personality ques- 
tionnaires for screening and diag- 
nostic purposes. An example is the 
use of check lists to determine which 
students are presently most in need 
of counseling. 

Interpretation of responses as self- 
description (e.g., by judging con- 
servatism from responses to a group 
of statements) represents one kind 
of assumption of content validity in 
the context of personality inventories. 

Construct validity is involved when 
the personality inventory is used to 
ascertain the personality traits or 
structure of the individual. 

Predictive or concurrent valida- 
tion of personality questionnaires can 
depend upon fairly clear-cut opera- 
tional criteria, e.g., reporting for sick 
call, membership in one occupational 
group as compared to another, psvy- 
On the other 
hand, in validation of conceptual in- 
ferences, problems arise because of 


chiatric classifications. 


the lack of a simple relationship be- 
tween personality traits and overt 
behavior. The “retiring” person may 
not actually behave in an unsociable 
manner, but the social activities in 
which he engages may be less satis- 
fying to him, and participating in 
them may result in emotional stress 
which manifests itself in tics or other 
psychosomatic symptoms. This type 
of validity will not be judged by the 
size of any given relationship be- 





18 Technical Recommendations 


tween a score and one criterion, but 
by the pattern of relationships which 
have been demonstrated to hold be- 
tween a score and a number of differ- 
ent kinds of behavior criteria. 


A pplication of the Concepts to Interest 
Inventories 


Most interest inventories are used 
In counsel- 
ing, scores are discussed in the con- 
text of a consideration of educational 
or vocational plans of the client. 
Even if the counselor makes very 
restricted interpretations (e.g., ‘‘the 
number of preferences for mechanical 
activities you reported is exceeded by 
only 5 per cent of high school 
seniors’), the context in which this 
discussion occurs implies that this 
information has some direct bearing 
on future performance. The test is 
really interpreted as indicating some- 


for predictive purposes. 


thing about the client’s probable suc- 
cess, satisfaction, or continuity in 
some activity. 

Description of the individual is a 


interest inventories. 
Interests are described in terms of 
categories or traits. In some de- 
vices, the description of interests in- 
volves such broad categories as to be 
essentially a description of gen- 
eralized personality traits. This in- 
volves content validity, and often 
construct validity. 

Different inferences must be sup- 
ported by different types of evidence. 
In general, since counseling involves 
consideration of a very large number 
of vocations, it is not expected that 
every judgment for which an interest 
inventory is used will be validated by 
direct empirical evidence. Clients 
wish to consider very miany occupa- 
tions and activities. It is not possible 


second use of 


(218) 


to perform empirical studies with 
respect to all these, and reasonable 
tentative inferences may often be 
made in the absence of evidence from 
empirical studies. Knowledge from 
internal analysis of the inventory, job 
descriptions, and other sources may 
permit interpretations that will assist 
the client. Such extrapolations 
should be made tentatively, how- 
ever. Extrapolation is found in the 
use of the Strong Blank to describe 
such traits as “interest in social up- 
lift occupations,”’ and in the use of 
Kuder scores to describe interest in 
vocations for which validity has not 
been tested. 


A pplication of the Concepts to Projec- 
tive Techniques and Related Clint- 
cal Methods 


Predictive, concurrent, and con- 
struct validity all have pertinence to 
projective techniques although con- 
struct validity greatly overshadows 
the Gther two kinds. The prediction 
of a specific act of behavior is rarely 
made on the basis of projective in- 
struments. Even the prediction of 
less specific behavior, such as “ability 
to profit from psychotherapy” is 
seldom made on the basis of projec- 
tive techniques alone; in fact, there 
are a number of workers in this field 
who take the position that such an 
attempt should never be made. 

Concurrent validity may be de- 
sired in projective techniques and 
clinical use of ability tests since they 
are used in making diagnostic clas- 
sifications. 


C 1. When validity is reported, the 
manual should indicate clearly what 
type of validity is referred to. The 
unqualified term ‘‘validity’’ should be 








(219) 


avoided unless its meaning is clear 
from the context. ESSENTIAL 


[Comment: The manual should make 
clear what type of inference the valida- 
tion study reports. No manual should 
report that ‘‘this test is valid.”” In the 
past, evidence that is not appropriately 
termed evidence of validity has been pre- 
sented in the manual under that heading. 
For example, the ‘‘validity”’ report of the 
Thurstone Interest Schedule deals solely 
with item-test The 
cussion of item-test correlations in the 
manual of the Heston Personal Adjust- 
ment Inventory illustrates 
data may be used in reporting test valid- 
ity without risk of misleading readers. 

It is not desirable for the manual to 
state that any one type of evidence is the 
only possible sort of validity evidence. 
The following statement made regarding 
the Ohio Penal Classification Test is 
misleading: ‘‘The only criterion whic! 
establishes an intelligence test as valid 
is that labelled ‘expert judgment’ or ‘ex- 
pert agreement.’ "’] 


correlations. dis- 


how such 


C 2. The manual should report 
the validity of each type of inference 
for which a test is recommended. If 
validity of some recommended in- 
terpretation has not been tested, 
that fact should be made clear. rEs- 


SENTIAL 


{Comment: In a test used for guidance 
it is obviously impossible to present pre- 
dictive validities for all possible criteria 
in which a counselor might be interested. 
The manual should make clear to the test 
user the nature and extent of the extra- 
polations suggested by the author of the 
test, or forced upon him by the problem 
confronting him. Enough information is 
available concerning intelligence tests, 
for example, ‘hat the limits of generaliza- 
tion can be fairly accurately gauged. Less 
is known about tests of spatial ability 
and they cannot be readily applied as 
predictors for criteria for which validity 


Validity 19 


studies have not been made. Hazardous 
extrapolation is likewise involved when 
tests are suggested as predictors of jobs 
solely on the basis of job analysis in- 
formation.] 


C 2.1 The manual should indicate 
which, if any, of the interpretations 
usually attempted for tests such as 
the one under discussion have not 
substantiated or are based 
merely on clinical impressions. Es- 
SENTIAL 


been 


[Comment: An example of a highly de- 
sirable practice is the warning to readers 
in the manual of the Purdue Pegboard: 
“Generalizations concerning the validity 
of any test should be made with great 
caution, and this is particularly true of 
dexterity tests. As Seashore has reported, 
motor skills are quite specific and ordi- 
narily not highly each 
This situation perhaps accounts 
for the fact that a given dexterity test 
may have a rather satisfactory validity 
for certain manipulative jobs and yet be 
unsuitable for other manipulative jobs 
which might seem to be very similar. It 


correlated with 


other. 


is therefore highly desirable to conduct 
a study of the validity of the several 
Pegboard among employees on 
specific jobs for which the use of the test 
is contemplated, rather than attempt to 


generalize validity 


tests 


from available 


studies.’’] 


C 2.11 If the manual for an in- 
ventory suggests that the user con- 
sult specific items as a basis for per- 
sonality assessment, it should either 
present validation data for this use or 
call attention to their absence. The 
manual should also warn of the wide 
margins of error inherent in such 
interpretative procedures. ESSENTIAL 

C 2.12 Validity of self-report as a 
description of the person’s behavior 
can be demonstrated only by com- 
paring responses on single items to 





20 Technical Recommendations 


observed behavior. In the absence 
of such evidence, the manual should 
warn the reader that such references 
are subject to extreme error and 
should be used only to direct further 
inquiry, as in a counseling interview. 
ESSENTIAL 


{[Comment: If two investigators using 
similar criteria obtain very different pre- 
dictive validities for a test, a presenta- 
tion of both sets of facts in the test 
manual is in order. If a test of-mechani- 
cal comprehension is validated against a 
clerical criterion, on the other hand, 
there is probably no value in reviewing 
these data in the manual. Badly con- 
trolled or badly analyzed studies need 
not be reported in the manual. 

Validation frequently 
small, with large standard errors of re- 
sulting coefficients. The only way in 
which large samples can be built is to 
pool results from several comparable 
studies. The cumulation of validating 
studies serves to set the limits on gen- 
eralization, by demonstrating whether a 
test applies equally well in a variety of 
situations. Desirable practice is illus- 
trated by the summary of validation 
studies provided in the 1946 manual for 
the Minnesota Clerical Test.] 


samples are 


Content Validity 


C 3. Findings based on _ logical 
analysis should be carefully dis- 
tinguished from conclusions estab- 
lished by correlation of test behavior 
with criterion behavior. ESSENTIAL 


[Comment: Content validity may be 
established by demonstrating that a test 
samples a particular area. The user can- 
not judge, from this alone, how well the 
test permits drawing conclusions about 
any form of behavior other than the test 
behavior. For instance, it is reported 
that an occupational interest inventory 
inquires about a sample of items, chosen 
to represent vocational areas according 
to their frequency of occurrence. This 


(220) 


is important information about the con- 
tent validity of the interest scores, but 
it does not alone establish whether the 
student’s scores predict how well he will 
be satisfied in a given type of job.] 


C 4. If a test performance is to be 
interpreted as a sample of perform- 
ance in some universe of situations, 
the manual should indicate clearly 
what universe is represented and 
how adequate the sampling is. Fs- 
SENTIAL 

C 4.1 The 


should 


universe of content 
be defined in terms of the 
sources from which items were drawn, 
or the content criteria used to in- 
clude and exclude items. ESSENTIAL 
[Comment: For example, the manual 
for the Lee-Thorpe Occupational Interest 
Inventory describes the method used in 
devising items from the definitions in the 
Dictionary of Occupational Titles.] 


C 4.2 The method 


of sampling 
items within the universe should be 


described. ESSENTIAL 


[Comment: R. H. Seashore prepared 
a vocabulary test, defining his universe 
as all words in a certain unabridged dic- 
tionary, and sampled according toa defi- 
nite plan.] 


C 4.3 If items are regarded as a 
sample from a universe, a coefficient 
of internal consistency should be re- 
ported for each descriptive score, to 
demonstrate the extent to which the 
score is saturated with common 
factors. ESSENTIAL 

[Comment: The present Lee-Thorpe 
manual does not report the internal con- 
sistency of its scales. See additional 
recommendations D 5—D 6 regarding in- 
ternal consistency studies.] 


C 4.4 If test performance is to be 


interpreted as a sample of perform- 
ance in some universe of situations, 








(221) 


and if the test is administered with a 
time limit, evidence should be pre- 
sented concerning the effect of speed 
on test scores. ESSENTIAL 


{Comment: The most satisfactory evi- 
dence would be the correlation of one 
form, given with the usual time limit, 
against another form given with un- 
limited time. This could be compared to 
the form-form coefficient with time limits 
on both forms. Other simpler informa- 
tion about degree of speeding should be 
given when this correlational study is 
impractical.] 


C 4.5 The date at which any study 
of the adequacy of sampling was 
made should be reported, and also 
the date of any sources of items. 
ESSENTIAL 

{Comment: In achievement testing, it 
is frequently the practice to select items 
by a careful sampling from textbooks to 
identify significant Textbooks 
and courses of study change, however, 
and the test which was once an excellent 
sample becomes obsolete. Therefore the 
manual should report some such state- 
ment as the median copyright date of the 
textbooks studied, or the date at which 
the experts agreed that the items were 
representative. In another field, the 
Mooney Problem Checklist lists prob- 
lems which are common to students, on 
which each individual is to check those 
which concern him. The Mooney 
manual properly reports the date when 
the list was collected. After this list has 
been used for many years, it will be 
valuable to conduct a further study to 
determine whether student problems 
have changed significantly, and, if so, to 
change the test and manual accordingly.]} 


topics. 


Predictive Validity 


C 5. When predictive validity is 
determined by statistical analysis, 
the analysis should be reported in a 
form from which the reader can de- 
termine confidence limits of esti- 


Validity 21 


mates regarding individuals, or the 
probability of misclassification of the 
individual on the criterion. ESSEN- 
TIAL 

C 5.1 Statistical procedures which 
are well known and readily inter- 
preted should be used in reporting 
validity whenever they are appropri- 
ate to the data under examination. 
Any uncommon statistical techniques 
should be explained. ESSENTIAL 

C 5.11 Reports of statistical vali- 
dation studies should ordinarily be 
expressed by: (a) correlation 
efficients of familiar types; (b) de- 
scription of the efficiency with which 
the test separates groups, indicating 
amount of misclassification or over- 
lapping; or (c) expectancy tables. 
ESSENTIAL 


CoO- 


{Comment: Reports of differences be- 
tween means of groups, or critical ratios, 
are by themselves inadequate informa- 
tion regarding predictive validity. If a 
sample is large, high critical ratios may 
be found even when classification is very 
inaccurate. 

In general, since manuals are directed 
to readers who have limited statistical 
knowledge, every effort should be made 
to communicate validity information 
clearly. An example of unwise use of a 
novel statistical method is found in the 
manual for the Ohio Penal Classification 
Test. Ten cases were chosen, separated 
at five-point intervals along the OPCT 
IQ scale. The 10's were then correlated 
with Wechsler IQ's, yielding a rank cor- 
relation of .93. This correlation is 
greater than would be obtained for any 
sample not artificially spread along the 
While unusual statistical pro- 
cedures should be used for special prob- 
they should not be 


scale. 


where 
standard methods are equally or more 
efficient for evaluating the data. They 
certainly should be presented so that 
they will not mislead the typical user of 
the manual. 


lems, used 





22 Technical Recommendations 


When a test is recommended for the 
purpose of dividing patients among dis- 
crete categories, correlational measures 
of association should be supplemented 
by percentage figures on misclassifica- 
tion, i.e., “false positives’’ and ‘‘false 
negatives.”” When validation involves 
comparison of men in an occupation with 
men-in-general, the comparison should 
be presented in such a way as to make 
clear the degree to which the occupa- 
tional group overlaps the general group.] 


C 5.2 An over-all validity coeffi- 
cient should be supplemented with 
evidence as to the validity of the test 
at different points along the range, 
unless the author reports that the 
validity is 
throughout. 


essentially 
VERY 


constant 
DESIRABLE 


{Comment: This might be reported by 
giving the standard error of estimate at 
various test score levels, or by indicating 
the proportion of hits, misses, and false 
The 
Test 
reports the number of failures in primary 
reading expected at each level of test 
score. | 


positives at various cutting scores. 


Metropolitan Reading Readiness 


C 5.3 Test manuals should not re- 
port corrected for un- 
reliability of the test as estimates of 
predictive validity. ESSENTIAL 


coefficients 


{Comment: Corrections for attenua- 
tion are very much open to misinter- 
pretation, and if misinterpreted give an 
unjustifiably favorable picture of the 
validity of the test. The hazard is il- 
lustrated in the manual for the Heston 
Personal Adjustment Inventory. Heston 
reports correlations between inventory 
scores and criterion ratings, and also re- 
ports the correlations augmented to cor- 
rect for attenuation. He then applies 
significance tests to the augmented 
correlations rather than to the raw cor- 
relations only. Further, he comments 
that the augmented correlations ‘‘are as 
high as those often secured between col- 


(222) 


lege aptitude tests and college grades.” 
This comparison is improper, since 
Heston is comparing his augmented co- 
efficients with uncorrected coefficients 
for ability tests.] 


C 5.31 If such coefficients are re- 
ported for the special purpose of 
studying construct validity, the un- 
corrected coefficients must be re- 
ported also and the proper interpreta- 
tion of the corrected coefficients must 
be discussed. ESSENTIAL 

C 6. All measures of criteria should 
be described accurately and in de- 
tail. The manual should evaluate the 
adequacy of the criterion. It should 
draw attention to significant aspects 
of performance which the criterion 
measure does not reflect and to the 
irrelevant factors which it may re- 
flect. ESSENTIAL 


{Comment: Desirable practices are il- 
lustrated in the manual of the General 
Clerical Test, where validity is reported 
in three specific studies. The nature of 
the criterion, and the nature of the work 
done by the employees tested is de- 
scribed. Limitations on the data are 
mentioned, and stress is placed on the 
necessity of making comparable studies 
with local criteria in any new situation 
where the test is to be applied. 

lor specific types of criteria, partic- 
ular cautions in description are needed 
to avoid misconceptions or ambiguities. 
Some of these are listed in the recom- 
mendations which follow.] 


C 6.1 When validity of a test is 


measured by agreement with psy- 


the 
be specific and 
categories clearly described. 
DESIRABLE 


diagnostic 
the 
VERY 


diagnoses, 
terms should 


chiatric 


[Comment: ‘Paranoid schizophrenia, 
chronic” is preferable as a category to 
P Since the types of 
patients included in specific diagnostic 


schizophrenia. 





(223) 


classifications vary to some extent de- 
pending on the point of view of the psy- 
chiatrists, a description of each diagnostic 
category used in the validity study 
should be presented. An example of good 
practice is found in Rapaport’s Diag- 
nostic Psychological Testing where each 
diagnostic group is summarily described 
in terms of characteristics judged by the 
psychiatrists to be basic.] 


C 6.11 If the individual usage 
given to a vague or variable clinical 
term by the validating psychiatrist is 
not known, this fact should’ be 
clearly stated and the reader warned 
that other raters or measuring de- 
vices might not 
terion. 


agree with the cri- 
VERY DESIRABLE 


C 6.12 When validity of a clinical 
test is indicated by agreement with 
psychiatric judgment, the training, 
experience, and professional status 
(e.g., diplomate) of the psychiatrist 


should be stated. VERY DESIRABLE 

C 6.13 When validity of a clinical 
test is indicated by agreement with 
psychiatric judgment, the amount 
and character of the patient contacts 
upon which the judgment is based 
should be stated. ESSENTIAL 

C 6.2 When validity of an apti- 
tude test is determined for predicting 
performance in an occupation, the 
occupation should be accurately de- 
fined. The test user should be given 
a clear understanding as to what 
duties are performed by workers in 
that occupation. ESSENTIAL 

C 6.21 Where a 
duties is subsumed under a given 
occupational label, the test user 
should be warned against assuming 
that only one pattern of interests or 
abilities can 


wide range of 


be satisfied in the oc- 
cupation. VERY DESIRABLE 
C 6.3 When validity of an apti- 


tude or interest test for predicting 


Validity 23 


performance in a course or curriculum 
is reported, the character of the 
course or curriculum should be clearly 
detined. ‘The test user should be 
given a clear understanding as to 
what types of performance are re- 
quired in the course. ESSENTIAL 

C 6.4 When predictive validity of 
an interest test is reported, the 
manual should state whether the 
criterion indicates satisfaction, suc- 
cess, or merely continuance in the 
activity under examination. ESSEN- 
TIAL 


[{Comment: When validation data com- 


pare men in an occupation to men-in- 
general, the manual should point out the 
limitations of presence in an occupation 


as a sign of success.] 


C 6.5 The time elapsing between 
the and determination of the 
criterion should be reported. ESSEN- 
TIAL 

C 6.51 If a test is recommended 
for long-term predictions, but data 
from longitudinal studies are not 
presented, the manual should em- 
phasize that predictions of this sort 
have uncertain validity. ESSENTIAL 

C 7. The reliability of the criterion 
should be reported if it can be de- 
termined. If such evidence is not 
available, the author should discuss 
the probable reliability as judged 
from indirect evidence. VERY DESIR- 
ABLE 


test 


[Comment: When validity is measured 
by agreement of the test with psychiatric 
judgment, for example, statistical evalu- 
ation of the agreement among judges 
should be reported.] 


C 7.1 If validity coefficients are 
corrected for unreliability of the 
criterion, both corrected and uncor- 
rected coefficients should be reported 
and properly interpreted. ESSENTIAL 





24 Technical Recommendations 


C8. The date when validation 
data were gathered should be re- 
ported. ESSENTIAL 

C 8.1 If the criterion, the condi- 
tions of work, the type of person 
likely to be tested, or the meaning 
of the test items is suspected of 
changing materially with the passage 
of time, the validity of the test 
should be rechecked periodically and 
the results reported in subsequent 
editions of the manual. VERY DESIR- 
ABLE 

{Comment: Criterion data for the 
Psychologist scale of the Strong Voca- 
tional Interest Blank were gathered in 
1927. 
these psychologists were no longer repre- 
sentative of the field. The current man- 
ual reports the date (1948) of the vali- 
dating studies for the revised key.] 


Subsequent research showed that 


C 9. The criterion score of a per- 
son should be determined independ- 
ently of his test score. The manual 
should describe precautions taken to 
avoid contamination of the criterion 
or should warn the reader of any pos- 
sible contamination. ssueNTIAL 

C 9.1 When the criterion consists 
of a rating, grade, or classification 
assigned by an employer, teacher, 
psychiatrist, etc., the manual must 
state whether the data were 
available to the rater or were capable 
of influencing his judgment in any 
way, e.g., indirectly through other re- 
ports of the psychologist. ESSENTIAL 

C 9.11 If the test data could have 
influenced the criterion rating, this 
fact should be emphasized and the 
user warned that the reported valid- 
ities are thus contaminated and are 
likely to be spuriously raised. ESSEN- 
TIAL 

C 10. Test scores to be used in 
validation should be determined in- 


test 


(224) 
dependently of criterion scores. Es- 
SENTIAL 


[Comment: In any test where knowl- 
edge about the subject may influence 
test administration or scoring, for in- 
stance in individual intelligence tests or 
projective techniques, the test admin- 
istrator should possess no knowledge of 
the behavior of the subject outside the 
test situation. The manual should dis- 
cuss the extent to which contamination 
of this type is possible unless it is obvious 
from the character of the test that no 
such contamination could occur.  Rec- 
ommendation C 11 below refers to a 
special kind of contamination frequently 
found in studies of objective tests.] 


C 11. When items are selected or 
a scoring key is established empiri- 
cally on the basis of evidence gathered 
on a particular sample, the manual 
should not report validity coefficients 
computed on this sample, or on a 
group which includes any of this 


sample. The reported validity co- 
efficients should be based on a cross- 
validation sample. ESSENTIAL 

C 11.1 If the manual recommends 
certain regression weights, any valid- 
ity reported for the composite should 
be based on a cross-validation sample. 
VERY DESIRABLE 

|Comment: A_ possible exception to 
recommendation C 11.1 is that a cross- 
validation sample would not be required 
if an appropriate correction for shrinkage 
could be applied to data from the original 
sample. Corrections available at present 
are not adequate for this purpose.] 


C 12. If the manual recommends 
that interpretation be based on the 
test profile, evidence should be pro- 
vided that the shape of the profile is a 
valid predictor. VERY DESIRABLE 


{Comment: One suitable method, for 
example, is to tabulate test profiles hav- 





(225) 


ing the same two highest scores, to show 
what proportion of these persons are 
successful or unsuccessful, and to com- 
pare the discriminating ability of these 
combined scores with that of 
score. ] 


a single 


C 12.1 If the interpretation em- 
phasizes complex nuances of the 
profile pattern which cannot be fully 
specified and depend upon the clini- 
cal experiences of the user, evidence, 
specifying the training and experi- 
ence of the clinicians, should be 
presented to show how much increase 
in accuracy over more simplified in- 
terpretations is gained. ESSENTIAL 

C 12.2 If the matching method is 
used to establish validity for the 
test report as a whole, the manual 
should point out that this analysis 
does not establish the validity of the 
component variables. ESSENTIAL 

C 13. The _ validation sample 
should be described sufficiently for 
the user to know whether the persons 
he tests may properly be regarded 
as represented by the sample on 
which validation was based. EssEN- 
TIAL 

C 13.1 The user should be warned 
against assuming validity when the 
test is applied to persons unlike those 
in the validating sample. ESSENTIAL 

C 13.2 Appropriate measures of 
central tendency and variability of 
test scores for the validation sample 
should be reported. ESSENTIAL 

C 13.3 The number of cases in the 
validation sample should be reported. 
The group should be described in 
terms of those variables known to be 
related to the quality tested: these 
will normally include age, sex, socio- 
economic status, and level of educa- 
tion. Any selective factor which re- 
stricts or enlarges the variability of 


Validity 25 


the sample should be indicated. 
ESSENTIAL 

{[Comment: In tests validated on pa- 
tients, the diagnoses of the patients 
would usually be important to report. 
The severity or obviousness of the diag- 
nosed condition should be stated when 
feasible. In tests for industrial use or 
vocational guidance, occupation and ex- 
perience of the validation sample should 
be described.] 


C 13.4 If the validation sample is 
made up merely of ‘available rec- 
ords,”’ this fact should be stated. The 
test user should be warned that the 
group is not a systematic sample of 
any specifiable population. ESSEN- 
TIAL 

C 13.5 Asample made up of “‘avail- 
able records” should be discussed in 
some detail as to probable selective 
factors and their presumed influence 
on test variables. VERY DESIRABLE 

C 13.6 If validation is demon- 
strated by comparing groups which 
differ on the criterion, the manual 
should report whether and how much 
the groups differ on other relevant 
variables. ESSENTIAL 

[Comment: Groups which differ on a 
criterion may also differ in other respects, 
so that the test may be discriminating on 
a quality other than that intended. 
Score differences between types of pa- 
may reflect differ- 
education, or length of time 
in hospital, unless these factors are con- 
trolled.] 


C 14. The author should base vali- 
dation studies on samples compa- 
rable, in terms of selection of cases 
and conditions of testing, to the 
groups to whom the manual recom- 
mends that the test be applied. VERY 
DESIRABLE 

C 14.1 If the test score distribu- 


tients, for instance, 


ences in age, 





26 Technical Recommendations 


tion of the validation sample is 
markedly different from the dis- 
tribution of the group with whom the 
test is ordinarily to be used, coeffi- 
cients or other measures of dis- 
crimination should be corrected to 
the value estimated for the group to 
whom the test is to be given. ESSEN- 
TIAL 


{Comment: A biserial correlation be- 
tween a scholastic aptitude test and col- 
lege where the persons dis- 
tinguished are dropouts and honor stu- 
dents, will be much higher than a coef- 
ficient based on all entering students. 
The test will normally be applied to the 
latter group, and the validity coefficient 
should emphasize the power of the test in 
that group. A correction to raise the va- 
lidity coefficient may likewise be needed 
when a test is validated on a group of 
selected employees. It is always prefer- 
able, however, to gather criterion data 
for an unselected group.] 


C 14.2 In reporting coefficients 
corrected for range, the manual 
should report the original coefficient, 
and the distribution characteristics 
used in making the correction and 
the formula employed in making the 
correction. ESSENTIAL 

C 14.3 Validation of tests intended 
for use in guidance should generally 
be based upon subjects tested at the 
time when they are making educa- 
tional or vocational choices. VERY 
DESIRABLE 


success, 


{Comment: Strong standardized his 
Vocational Interest Blank on men who 
were currently employed in the occupa- 
tion in question. The ability of these 
scales to differentiate between occupa- 
tional groups did not, in and of itself, 
warrant using the inventory in the coun- 
seling of high school or college students. 
Strong obtained better evidence by ad- 
ministering the inventory to students and 


(226) 


ascertaining the nature of their later em- 
ployment, thus establishing the relation- 
ship between preoccupational score and 
later occupation.] 


C 14.4 If a test is presented as 
being useful in the differential diag- 
nosis of patients, it should include 
evidence of the test’s ability to sepa- 
rate diagnostic groups from one 
another. Emphasis should be placed 
on this rather than on the differ- 
entiation of diagnosed abnormal cases 
from the normal population. ESSEN- 
TIAL 

C 15. If the validity of the test can 
reasonably be expected to be differ- 
ent in subgroups which can be identi- 
fied when the test is given, the 
manual should report the validity for 
each group separately or should re- 
port that no difference was found. 
VERY DESIRABLE 

C 15.1 Occupational predictions 
by means of interest tests should be 
validated within a group all of whom 
have the same stated vocational aim. 
DESIRABLE 


{Comment: An interest inventory is an 
attempt to obtain more accurate and 
complete information than would be ob- 
tained by a simple question such as “List 
your preferred occupation.”’ Whether the 
inventory yields useful information can 
be demonstrated only by showing that, 
among persons who give the same answer 
to this simple question, the test makes 
valid discriminations. It is important 
to move in the direction of reporting 
whether among students stating a prefer- 
ence for engineering (for example), those 
who earn high scores do differ on the cri- 
terion from those who earn lower scores.] 


C 15.2 Validity of predictions from 


tests should be estimated 
separately at different levels of men- 
tal ability. DESIRABLE 


interest 





(227) 


C 16. Reports of validation studies 
should describe any conditions likely 
to affect the motivation of subjects 
for taking the test. ESSENTIAL 


[Comment: If an ability test is to be 
used for employee selection, it should be 
validated using subjects who are candi- 
dates for employment and are therefore 
motivated to perform well. Under some 
testing conditions, a subject might try to 
“fake” his self-report of interests or per- 
sonality; the controls used to discourage 
such faking should be reported.] 


Concurrent Validity 


All recommendations listed under 


predictive validity also apply to reports 
of concurrent validity, with the excep- 
tion of C 5. 

C 17. Reports of concurrent valid- 
ity should be so described that the 
reader will not regard them as estab- 
lishing predictive validity. ESSENTIAL 


[Comment: The Minnesota Teacher 
Attitude Inventory is validated against 
contemporary teaching performance. 
This is reported under the general head- 
ing of ‘‘validity,”’ and use of the test for 
selecting teachers or teacher-training can- 
didates is recommended. The manual 
should point out that there have so far 
been no 


studies measuring entering 


students and observing them later on the 


job.] 


C 17.1 For occupational _ tests 
where there are no_ longitudinal 
studies following subjects from the 
time of testing to the point where 
criterion information is available, 
validation data obtained by testing 
samples of employed persons should 
be presented. VERY DESIRABLE 

{Comment: One such method of pre- 
liminary validation is to compare the 
distribution of scores for men in an occu- 
pation with those for men-in-general.] 


Validity 27 


C 17.11 If data from employed 
persons are used, evidence as to the 
effects of experience on interest in- 
ventory scores should be presented. 
ESSENTIAL 


Construct Validity 


Recommendations C 3-C 16 and 
LD) 5 apply to some reports of con- 
struct validity. 

C 18. The manual should report 
all available information which will 
assist the user in determining what 
psychological attributes account for 
variance in test scores. ESSENTIAL 

C 18.1 The manual should report 
correlations between the test and 
other tests which are better under- 


stood. VERY DESIRABLE 


[Comment: It is desirable, for in- 
stance, to know the correlation of an 
“art aptitude” test for college freshmen 
with measures of general or verbal abil- 
ity, and also with measures of skill in 
The interpretation of test 
scores would differ, depending on whether 
these correlations are high or low. On the 
other hand, it is clearly impractical to 
ask that the test author correlate his test 
with all prominent tests. It is especially 
valuable to know correlations of this test 
with other measures likely to be used in 
making about the 
tested.| 


C 18.2 The manual should report 
the correlations of the test with other 
previously published and generally 
accepted measures of the same at- 
tributes. VERY DESIRABLE 


drawing. 


decisions person 


[{Comment: When a test is advanced as 
a measure of “general adjustment,” its 
correlation with one or more other such 
measures should be reported. Similarly, 
if a test measure of 
é -} ° at ” aes ‘ ° ” 

mechanical interest” or “‘introversion, 
its correlations with other measures of 


is advanced as a 





28 Technical Recommendations 


The user 
can infer, from the size of such correla- 
tions, whether generalizations established 
on the older test can be expected to hold 
for the new Practical limitations 
will prevent the author from correlating 
An ex- 
ample of good practice is the report, in 
the Thurstone Interest Schedule, of cor- 
relations with KKuder 
scores. | 


these traits should be reported. 


one. 


his test with all competing tests. 


corresponding 


C 18.3 If a test given with a time 
limit is to be interpreted as measur- 
ing a hypothetical psychological at- 
tribute, evidence should be presented 
concerning the effect of speed on test 
scores and on the correlation of 
scores with other variables. VERY 
DESIRABLE 

C 18.4 If a test has been included 
in factorial studies which indicate 
the proportion of the test variance 
attributable to widely known refer- 
ence factors, such information should 
be presented in the manual. DESIR- 
ABLE 

C 19. The manual for a test which 
is used primarily to assess postulated 
attributes of the individual should 
outline the theory on which the test 
is based and organize whatever par- 
tial validity data there are to show in 
what way they support the theory. 
VERY DESIRABLE 


D. Reliability 


Reliability is a generic term re- 
ferring to many types of evidence. 
The several types of reliability co- 
efficient do not answer the same 
questions and should be carefully dis- 
tinguished. We shall refer to a 
measure based on internal analysis 
of data obtained on a single trial of a 
test as a coefficient of internal con- 
sistency. The most prominent of 


(228) 
these are the analysis of variance 
method (Kuder-Richardson, Hoyt) 
and the split-half method. A corre- 
lation between scores from two forms 
given at essentially the same time we 
shall refer to as a_ coefficient of 
equivalence. Vhe correlation between 
test and retest, with an intervening 
period of time, is a coefficient of 
stability. Such a coefficient is also 
obtained when two forms of the test 
are given with an intervening period 
of time. 


{Comment on projective tests: It is 
generally recognized that projective tests 
present even more than the usual difficul- 
ties in assessing reliability. It is not al- 
ways clearly appropriate to demand in- 
ternal consistency or stability and as yet 
equivalent forms for the most part do not 
exist. It seems reasonable, however, to 
require an assessment of stability for such 
instruments even though it is recognized 
in some instances that a low retest sta- 
bility over a substantial period merely re- 
flects true trait fluctuation and hence in- 
dicates good validity. Clinical practice 
rarely presumes that the inference from 
projective tests are to be applied on the 
very day the test is given. Realistically, 
we must recognize that pragmatic deci- 
sions are being made from test data which 
are meaningful only in terms of at least 
days, and usually weeks or months, of 
therapy and other procedures following 
the test administration. If a certain test 
result is empirically found to be highly 
unstable from day to day, this evidence 
casts doubt upon the utility of the test 
for most purposes even if that fluctuation 
might be explained by hypothesis of trait 
inconstancy. 

This reasoning applies strictly only to 
the inferred dimensions, and not neces- 
sarily to the directly scored dimensions. 
If a personality variable is estimated 
from a complex of several test variables, 
and in such a way that rather different 
combinations of the test variables can 





(229) 


lead to the same value of the estimate, 
it is the temporal stability of the estr- 
mate which is subjected to the preced- 
ing requirement. gut the burden of 
proof lies clearly upon the test manual. 
If component scores are unstable, it is 
then necessary to gather evidence regard- 
ing the degree to which estimates of the 
underlying personality dimension are 
stable during the interval for which they 
are intended to be used.|] 


D 1. The test manual should re- 
port such evidence of reliability as 
would permit the reader to judge 
whether scores are sufficiently de- 
pendable for the recommended uses 
of the test. If any of the necessary 
evidence has not been collected, the 
absence of such information should 
be noted. ESSENTIAL 

D 1.1 Recommendation D 1 


plies to every score, subscore, or coni- 


T)}- 
ap 


bination of scores whose interpreta- 
tion is suggested. ESSENTIAL 

D 1.2 If differences between scores 
are to be interpreted or if the plotting 
of a profile is suggested, the manual 
should report the reliability of differ- 
ences between scores. ESSENTIAL 

D 1.21 If reliability of differences 
between an individual's scores is low, 
the manual should caution the user 


against interpreting profiles or score 


differences except as a source of pre- 
liminary information to be verified. 
ESSENTIAL 


{Comment: The California Test of 
Mental Maturity reports reliability co- 
efficients for the main and for 
scores on the major sections. Each sec- 
tion is further divided, the Spatial sub- 
tests, for example, including a group of 
items on Manipulation of Areas. By list- 
ing scores for such subsections on the 
profile’ sheet, the authors indirectly en- 
courage interpretation of them. While 
supplementary material on the test men- 


scores 


Reliability 29 


tions the low reliability of the subsec- 
tions, the manual does not. It would be 
sounder practice to plot only those scores 
whose reliability is determined and re- 
ported in the manual. 

The Watson-Glaser Critical Thinking 
Appraisal suggests that study of pupil 
performance on various types of items 
may enrich the interpretation. The man- 
ual adds this desirable caution: 

“For a relatively small number of 
items such indices and special scores 
would not have high statistical reliabil- 
ity, and hence attention should be paid 
only to extreme deviates. For this rea- 
son norms for these special scores are not 
given and they are suggested only as an 
aid in helping students.” 

This paragraph illustrates how a man- 
ual may conform to the spirit of the 
Technical Recommendations even when 
some form of data is not provided in the 
manual. The statement would be im- 
proved if it were worded “such indices 
re probably unreliable” in the place of the 
present correct but euphemistic phras- 
ing. | 


D 1.3 One or more measures of 
reliability should be reported even 
when tests are recommended solely 
for empirical prediction of criteria. 
DESIRABLE 

[Comment: The FE. R. C. Stenographic 
Aptitude Test validity coef- 
ficients without also giving an estimate 
of reliability. For certain judgments such 
as the potential etfect of lengthening the 
test information about reliability is re- 
quired and should be available to the 
user. | 


reports 


D 1.4 In connection with reliabil- 
ity measures, the manual should re- 
port whether the error of measure- 
ment varies at different score levels. 
If there is significant change in the 
error of measurement from level to 
level, this fact should be properly in- 
terpreted. VERY DESIRABLE 





30 Technical Recommendations 


[Comment: Terman and Merrill point 
out that differences in 10 from Form L 
to Form M of the Revised Stanford- 
Binet Scale are much larger for IQ's 
above 100 than for low IQ’s. 

The California Test of Personality in- 
tentionally yields markedly 
the reliability co- 
efficients from the value that might be 
attained normal distribution of 
raw scores, but reduces the error of iden- 
tifying the most maladjusted cases. Here 
the most appropriate information on reli- 
ability would be the expected variation 
of percentile scores from trial to trial, 
reported separately for low and _ high 
scores.] 


skewed 
scores. This lowers 


with a 


D 1.5 Reports of reliability studies 
should ordinarily be expressed in 
terms of: (a) the product-moment 
correlation coefficient; (b) another 
standard measure of relationship 
suitable to categorical judgments; or 
(c) the standard error of measure- 
ment. ESSENTIAL 


[Comment: Chi square is not an ade- 
quate index of reliability for categorical 
judgments, since it reflects level of sig- 
nificance rather than magnitude of rela- 
tionship.] 


D 2. The manual should avoid any 
implication that reliability measures 
demonstrate the predictive or con- 
current validity of the test. EssEN- 
TIAL 


[Comment: Properly interpreted reli- 
ability coefficients may support analysis 
of content or construct validity.] 


D 3. In reports of reliability, pro- 
cedures and sample should be de- 
scribed sufficiently for the reader to 
judge whether the evidence applies 
to the persons and problem with 


which he is concerned. 
D 3.1 Evidence of 


ESSENTIAL 
reliability 


(230) 


should be obtained under conditions 
like those in which the author recom- 
mends that the test be used. VERY 
DESIRABLE 


{Comment: The maturity of the group, 
the variation in the group, and the atti- 
tude of the group toward the test should 
represent normal conditions of test use. 
lor example, the reliability of a test to be 
used in selecting employees should be de- 
termined by testing applications for posi- 
tions rather than by testing college stu- 
dents, or workers already employed. ] 


D 3.2 Thereliability sample should 
be described in terms of any selective 
factors related to the variable being 
measured, usually including age, sex, 
and educational level. Number of 
cases of each type should be reported. 
ESSENTIAL 

D 3.3 Appropriate measures of 
central tendency and variability of 
the test scores of the reliability 
sample should be reported. ESSEN- 
TIAL 

D 3.31 If reliability coefficients are 
corrected for restriction of range, the 
nature of the correction should be 
made clear. The manual should also 
report the uncorrected coefficient, to- 
gether with the standard deviation of 
the group tested and the standard 
deviation assumed for the corrected 
sample. In discussing such coeffi- 
cients, emphasis should be placed on 
the one which refers to the degree of 
variation within which discrimina- 
tion is normally required. ESSENTIAL 

D 3.4 When a test is ordinarily re- 
quired to make discriminations within 
a subclass of the total reliability 
sample, the reliability within each 
class should be investigated sepa- 
rately. If the coefficients differ, each 
separate coefficient should be _ re- 
ported. VERY DESIRABLE 





(231) 


{Comment: The Mechanical Reasoning 
section of the Differential Aptitude Tests 
has different reliability for boys and girls. 
The manual reports the reliability for 
each sex and grade.] 


D 3.5 The manual should not im- 
ply that if some method had been 
used to determine reliability other 
than the one actually used, an ap- 
preciably higher coefficient would 
have been obtained. ESSENTIAL 


Equivalence of Forms 


D 4. If two forms of a test are 
made available, with both forms in- 
tended for possible use with the 
same subjects, the correlation be- 
tween forms and information as to 
the equivalence of scores on the two 
forms should be reported. If the 
necessary evidence is not provided, 
the manual should warn the reader 


against assuming comparability. rs- 
SENTIAL 

D 4.1 Where two trials of a test 
are correlated to determine equiva- 


lence, the time between testings 
should be stated. ESSENTIAL (see also 
D 7) 

D 4.2 Where the content of the 
test items can be described meaning- 
fully, a comparative analysis of the 
forms is desirable to show how simi- 
lar they are. DESIRABLE 


Internal Consistency 


D 5. If the manual suggests that a 
score is a measure of a generalized, 
homogeneous trait, evidence of in- 
ternal consistency should be re- 
ported. ESSENTIAL 


[Comment: Internal consistency is im- 
portant if items are viewed as a sample 
from a relatively homogeneous universe, 
as in a test of addition with integers, or a 


Reliability 31 


test presumed to measure introversion. 
In a test which is regarded as a collection 
of diverse items, such as the Mooney 
Checklist, internal consistency is a minor 
consideration. ] 


D 5.1 When a é test 


separately scored parts or sections, 


consists of 


the correlation between the parts or 
sections should be reported. ESSEN- 
TIAL 


{Comment: Whether it is desirable or 
undesirable to have high subtest correla- 
tions depends on the nature and purpose 
of the test. Information on homogeneity 
or internal consistency may be relevant 
to the construct validity of the test.] 


D 5.11 If the manual reports the 
correlation between a subtest and a 
total score, it should point out that 
part of this correlation is an artifact. 
ESSENTIAL 


[Comment: Desirable practice is il- 
lustrated in the 1953 manual for the Cali- 
fornia Test of Personality.] 


D 6. Coefficients of internal con- 
sistency should be determined by 
the split-half method or methods of 
the Kuder-Richardson type, if these 
can properly be used on the data 
under examination. Any other meas- 
ure of internal consistency which the 
author wishes to report in addition 
should be carefully explained. rs- 
SENTIAL 


[Comment: There will no doubt be un- 
usual circumstances where special co- 
efficients give added information. There 
are grave dangers of giving unwarranted 
impressions, however, as is illustrated in 
the case of the Brainard Occupational 
Preference Inventory. This test yields a 
set of scores which are interpreted as a 
profile. The manual reports no informa 
tion on the reliability of these scores, but 
does report a “‘total reliability’’ based on 





32 Technical Recommendations 


a formula by Ghiselli. This reliability 
seems not to correspond to any score 
actually interpreted, and what it indi- 
cates about the value of this particular 
test is unclear without more discussion 
than the manual provides. 

The original Kuder-Richardson formu- 
las apply to a restricted case. Of those 
formulas, the one known as Number 20 
is most satisfactory. A formula given by 
Hoyt, and others, has the same meaning 
but is more general in application. 

Guttman has also suggested a “repro- 
ducibility” formula which relates to in- 
ternal consistency. This index presents 
such special problems that it seems to 
have little suitability for test manuals.] 


D 6.1 For time-limit tests, split- 
half or analysis of variance coefficients 
should never be reported unless: 
(a) the manual also reports evidence 
that speed of work has negligible 
influence on scores; or (b) the coefh- 
cient is based on the correlation be- 


tween parts administered under sepa- 


rate time limits. ESSENTIAL 

{Comment: Evidence of accuracy of 
measurement for highly speeded tests is 
properly obtained by retesting or testing 
with independent equivalent forms. If 
better evidence is not available, it is ap- 
propriate to use lower-bound formulas 
designed for estimating the internal con- 
sistency of speeded tests to determine the 
minimum coefficient.] 


D 6.2 If several questions within a 
test are experimentally linked so that 
the reaction to one question influences 
the reaction to another, the entire 
group should be treated as an 
“item” in applying the split-half or 
analysis of variance methods. 
SENTIAL 


ES- 


[Comment: In a reading test, several 
questions about the same paragraph are 
ordinarily experimentally dependent. All 
of these questions should be placed in the 


(232) 


same half-test in using the split-half 
method. In the Kuder-Richardson meth- 
od, the score on the group of questions 
should be treated as an “item”’ score.] 


D 6.3 If a test can be divided into 
sets of items of different content, in- 
ternal consistency should be deter- 
mined by procedures designed for 
such tests. VERY DESIRABLE 

{Comment: One such procedure is the 
division of the test into “parallel” rather 
than random half-tests. Another pro- 
cedure is to apply the Jackson-Ferguson 
“battery reliability” formula.] 


Stability 


D 7. The manual should indicate 
what degree of stability of scores 
may be expected if a test is repeated 
after time has elapsed. If such evi- 
dence is not presented, the absence 
of information regarding stability 
should be noted. ESSENTIAL 

[Comment: Most educational and psy- 
chological tests measure qualities which 
are presumed to be stable for some time, 
unless training or specified experiences 
intervene. Stability is not always desir- 
able. A measure of interests in childhood 
and adolescence which is highly stable 
would not be sensitive to developmental 
changes. ] 


D 7.1 Stability of scores should 
be determined by administering the 
test to the same group at different 
times. The manual should report 
changes in mean score as well as the 
correlation between the two sets of 
ESSENTIAL 

D 7.11 If a test result is reported 
in terms of, pass-fail or some other 
classification, stability 
should be reported in terms of propor- 
tion of altered classifications on re- 
test. VERY DESIRABLE 

D 7.12 In determining 


scores. 


categorical 


stability 





(233) 


of scores by repeated testing, other 
precautions such as giving alternate 
forms of the test should be used to 
minimize recall of specific answers, 
especially if the time-interval is not 
long enough to assure forgetting. 
VERY DESIRABLE 

D 7.13 In reporting a coefficient of 
stability, the manual should describe 
the experience or education of the 
group between testings, if this would 
be expected to affect 
ESSENTIAL 

D 7.2 For tests of interest and 
ability intended for use prior to 
adulthood, the coefficient of stability 
should correlate scores obtained at 
one particular age with scores at 
some later significant age. 
cients should be reported separately 
for different ages at first test and for 
different periods of intervening time. 
ESSENTIAL 


test scores. 


Coethi- 


E. Administration and Scoring 


E 1. The directions for admin- 
istration should be presented with 
sufficient clarity that the test user 
can duplicate the administrative con- 
ditions under which the norms and 
data on reliability and validity were 
obtained. ESSENTIAL 

E 1.1 The published directions 
should be complete enough so that 
people tested will understand the 
task in the way the author intended. 
ESSENTIAL 

[Comment: If, for example, in a per- 
sonality inventory, it is intended that 
subjects give the first response that oc 
curs to them, this should be made clear 
in the directions for administration. Di- 
rections for interest should 
specify whether the person is to mark 
what he would ideally like to do, or 
whether he is also to consider the prob- 


inventories 


Administration and Scoring 33 


ability that he would have the opportu- 
nity and ability todo them. Likewise, the 
directions should specify whether the per- 
son is to mark those things he would wish: 
to do or does occasionally, or only those 
things he would like to do or does regu- 


larly.] 


E 1.2 If expansion or elaboration 
of instructions, giving of hints, etc., 
is permitted, the conditions for it 
should be clearly stated either in the 
form of general rules or by giving 
numerous examples, or both. VERY 
DESIRABLE 

E 1.21 If the examiner is allowed 
freedom and judgment in elaborating 
instructions or giving samples, em- 
pirical data should be presented re- 
garding the effect of variation in 
examiner procedures upon scores. If 
empirical data on the effect of varia- 
tion in examiner procedure are not 
available, this fact should be ex- 
plicitly stated and the user warned 
that the effects of such variation are 
unknown. ESSENTIAL 

E 1.3 If the test under considera- 
tion is of a type where previous ex- 
perience demonstrates that subjects 
are likely to present an unrealistic 
picture of themselves, the manual 
should give evidence regarding the 


extent to which such distortion may 
ESSENTIAL 


affect scores. 


[{Comment: Such evidence is ordinarily 
to be provided by measuring the shift of 
the test 
different situations (e.g., pre-employment 


scores when is administered in 
and postemployment) or with instruc- 
to induce different 
This problem is especially acute for per- 
sonality interest 


tions intended sets. 


and inventories and 


projec tive tec hniques.| 

E 1.31 If the test is provided with 
a verification key or key to correct for 
inappropriate test-taking attitudes. 





34 Technical Recommendations 


evidence that this key performs its 
function should be provided. ESSEN- 
TIAL 

E 2. Where subjective processes 
enter into the scoring of the test, evi- 
dence on degree of agreement be- 
tween independent scorings should 
be presented. If such evidence is not 
provided, the manual should draw 
attention to scorer error as a possible 
source of error of measurement. 


{[Comment: With projective tests, the 
role of interscorer agreement in the actu- 
al classification of raw response data is 
more crucial than in the case of a test 
where an “error in scoring’? means a 
clerical error or something close to that. 
Interscorer agreement is not a demon- 
stration of reliability in the usual sense, 
or a substitute for it. Interscorer agree- 
ment deals solely with the objectivity of 
classifying the behavior sampled from 
subjects, and is, therefore, directed at a 
condition on the part of the judge's be- 
havior that is necessary for “reliability.” 
Interscorer consistency is obviously not a 
sufficient condition, since it cannot pos- 
sibly give information regarding the ade- 
quacy of that behavior as a sample from 
the subject.] 


E 2.1 The bases for scoring and 
the procedure for training the scorers 
should be presented in sufficient de- 


tail to permit other scorers to reach 
the degree of agreement reported in 
studies of scorer agreement given in 
the manual. VERY DESIRABLE 

[Comment: One desirable practice is to 
present a list of the commoner responses 
or response categories with their scoring 
indicated.] 


E 2.11 If persons having various 
degrees of supervised training are 
expected to score the test, studies of 
the interscorer agreement at 
skill level should be presented. 
SIRABLE 


each 
DE- 


(234) 


E 2.2 If reliability of scoring is 
low, the manual should caution the 
user against interpreting combina- 
tions of such scores. ESSENTIAL 


[Comment: Combinations such as ra- 
tios generally will be even less reliable 
than the component scores.] 


F. Scales and Norms 


F 1. Scales used for reporting 
scores should be such as to increase 
the likelihood of accurate interpreta- 
tion and emphasis by test interpreter 
and subject. ESSENTIAL 


{Comment: Scales in which test scores 
are reported are extremely varied. Raw 
scores are used. Relative scores are used. 
Scales purporting to represent equal in- 
tervals with respect to some external di- 
mension (such as age) are used. And so 
on. It is unwise to discourage the devel- 
opment of new scaling methods by insist- 
ing on one form of reporting. On the 
other hand, many different systems are 
now used which have no logical advan- 
tage, one over the other. Recommenda- 
tions below that the number of systems 
now used be reduced to a few with which 
testers can become familiar, are not in- 
tended to discourage the use of unique 
scales for special problems. Suggestions 
as to preferable scales for general report- 
ing are not intended to restrict use of 
other scales in research studies.] 


F 2. Where there is no compelling 
advantage to be obtained by report- 
ing scores in some other form, the 
manual should suggest reporting 
scores in terms of percentile equiva- 
lents or standard scores. VERY DE- 
SIRABLE 


[Comment: Professional opinion is di- 
vided on the question whether mental 
test scores should be reported in terms of 
some theoretical growth scale, such as the 
intelligence quotient or the Heinis index. 
Thus, a test developer who has ration- 





(235) 

ale for such scales as these should use 
them if he regards them as especially 
adequate. 

On the other hand, there is no theoreti- 
cal justification for scoring mental tests 
in terms of an “IQ” which is not derived 
in terms of the theory underlying the 
Binet IQ and which has different statis- 
tical properties than the IQ does. Stand- 
ard or percentile scores would be prefer- 
able to arbitrarily defined IQ scales such 
used in the Otis 
Wechsler-Bellevue tests. 

Strong recommends that Vocational 
Interest Blank scores be converted into 
letter grades where ‘“A’’ indicates that 
at least two-thirds of the criterion group 


as are Gamma and 


equaled or exceeded a given score, etc. 

this recommendation on the 
ground that finer score discriminations 
would lead only to unwarranted at- 
tempts at finer interpretative discrim- 
ination.] 


He bases 


F 2.1 If grade norms are provided, 
tables for converting scores to per- 
centiles (or standard scores) within 
each grade should also be provided. 
ESSENTIAL 

{Comment: At the high school level, 
norms within courses (e.g., second year 
Spanish) may be more appropriate than 
norms within grades.] 


F 3. Standard scores obtained by 
transforming scores so that they have 
a normal distribution and a fixed 
mean and standard deviation should 
in general be used in preference to 
other derived scores. For some tests, 
there may be a substantial reason to 
choose some other type of derived 
score. VERY DESIRABLE 

F 3.1 If a two-digit standard score 
system is used, the mean of that sys- 
tem should be 50 and the standard 
deviation 10. DESIRABLE 

F 3.2 If a one-digit standard score 
system is used, the mean of the sys- 
tem should be 5 and the standard 


Scales and Norms 35 


deviation 2 (as in stanines). 


ABLE 


DESIR- 


[Comment: The foregoing are pro- 
posed as ways of standardizing practice 
among test developers. It is expected 
that established sys- 
tems, such as the College Board Scale, 
with mean at 500, will often retain them 
as suited to their purposes. ] 


institutions with 


F 3.3 Where percentile scores are 
to be plotted on a profile sheet, the 
profile sheet should be based on the 
normal probability scale. VERY DE- 
SIRABLE 


F 4. Local norms are more im- 
portant for many uses of tests than 
published norms. In such cases the 
manual should suggest appropriate 
emphasis on local norms. VERY DE- 
SIRABLE 
The Dic- 
Pest manual prec edes its presen- 


[Comment: Cooperative 
tionary 
tation of norms with a discussion urging 
schools to prepare local norms and ex- 
jlaining their advantages over the pub- 
with to this test. 

achievement tests, clinical tests, 

sts used for vocational guidance 
ht well present a similar statement.] 


norm respect 


F 5. Except where the primary use 
of a test is to compare individuals 
with their own loc “), norms 
should be publisheu ai :.c time of 
release of the test for ope: tional 
use. NTIAL 


The Thurstone Interest 
provides a profile of 20 raw 
Because each field is based on the 


ESS! 


(Comment: 
Schedule 
same number of items, norms are said 
Yet a change of items 
make that cate- 
Hence, to 
hether a high score reflects this 
individual's interests, or only that these 
items are popular with everyone, the 
user must consult a set of norms. Judg- 


to be unnecessary. 
i would 


more or less preferred. 


iny group 


} 


KNOW W 





36 Technical Recommendations 


ment in terms of raw scores could be 
made only if by some unusual method it 
could be demonstrated that the items in 
each category are a representative sample 
of that field.) 


F 5.1 Even though a test is used 
primarily with local norms, the 
manual should give some norms to 
aid the interpreter who lacks local 
norms. DESIRABLE 

F 6. Norms should report the dis- 
tribution of scores in an appropriate 
reference group or groups. ESSEN- 
TIAL 

F 6.1 Unless they can be readily 
inferred from the table of norms, 
measures of central tendency and 
variability of each distribution should 
be given. ESSENTIAL 

F 6.2 If the distribution in the 
norm group is not essentially normal, 
some form of percentile table should 
be provided. ESSENTIAL 

F 6.3 In addition to norms, tables 
showing what expectation a person 
with a given test score has of attain- 
ing or exceeding some relevant cri- 
terion score should be given where 
possible. Conversion tables translat- 
ing test scores into proficiency levels 
should be given when proficiency can 
be described on a meaningful ab- 
solute scale. DESIRABLE 

F 7. Norms should refer to de- 
fined and clearly described popula- 
tions. These populations should be 
the groups to whom users of the test 
will ordinarily wish to compare the 
persons tested. ESSENTIAL 


[Comment: Intelligence tests designed 
for use with elementary school children 
might well present norms by grade- 
groups as well as by chronological age- 
groups. 

For occupational inventories, norms 


(236) 


based on men who have entered specific 
occupations should be developed, except 
where cutting scores or regression formu- 
las are provided for predicting occupa- 
tional criteria. 

The manual should point out that a 
person who has a high degree of interest 
in a curriculum or occupation, when com- 
pared to men-in-general, will generally 
have a much lower degree of interest 
when compared with persons actually en- 
gaged in that field. 

Thus a ftgh percentile score on the 
Kuder mechanical scale, in which the 
examinee is compared with men-in-gen- 
eral, may be equivalent to a low per- 
centile when the examinee is compared 
with auto mechanics.] 


F 7.1 The manual should report 
the method of sampling within the 
population, and should discuss any 
probable bias within the sample. 
ESSENTIAL 

F 7.11 Norms should be based on a 
well-planned sample rather than on 
data collected primarily on the basis 
of availability. VERY DESIRABLE 


{Comment: Occupational and educa- 
tional test norms have often been based 
on scattered groups of test papers, and 
authors sometimes request that all users 
mail in results for use in subsequent re- 
ports of norms. Distributions so ob- 
tained will contain unknown. biases. 
Hence, the methods for obtaining the 
samples should be clearly described, as in 
Strong's manual, and whenever possible, 
samples should be stratified to remove 
some of the bias. Planned samples will 
give more dependable norms, however, 
since stratification cannot remove all 
sampling error.] 


F 7.2 The number of cases on 
which the norms are based should be 
reported. ESSENTIAL 


F 7.21 If the sample on which 





(237) 


norms are based is small or otherwise 
undependable, the user should be ex- 
plicitly cautioned regarding 
ESSENTIAL 


this. 


[Comment: In addition to general high 
school and college norms based on sub- 
stantial samples, medians and ranges in 
small, special groups are reported for the 
Watson-Glaser Critical Thinking Ap- 
praisal. Since these samples vary from 10 
cases to 65 cases, the ranges and medians 
are highly unstable. This manual should 
report quartiles in preference to ranges 
because quartiles are more stable. This 
manual should warn the reader as to the 
fallibility of estimates from these special 
samples.] 


F 7.3 The manual should report 
whether scores differ for groups dif- 
fering on age, sex, amount of training, 
and other equally important vari- 
ables. ESSENTIAL 

F 7.31 If appreciable differences 
between groups exist, and if a per- 
son would ordinarily be compared 
with a subgroup rather than with a 
random sample of persons, then sepa- 
rate norm tables should be provided 
in the manual for each group. ESSEN- 
TIAL 


[Comment: An example of unusually 
excellent practice is the norms for the 
Minnesota Teacher Attitude Inventory. 
Here norms are based on teachers sep- 
arated by levels of experience, amount of 
training, and type of position The 
teachers were obtained by a planned 
sample. The manual discusses differ- 
ences between sex groups but does not 
present separate norms, as the decision 
to employ a particular man _ teacher 
rather than a woman would be based on 
the raw score of each, rather than upon 
their standings within their sex group. 

Norms for interest inventories should 
be prepared separately for student exam- 


Scales and Norms 37 


inees having different levels of general 
academic ability unless there is evidence 
that scores have no relation to ability.] 

F 7.32 When the total amount of 
scorable behavior is allowed by the 
task to vary, separate norms on the 
various scored variables should be 
presented for different levels of total 
response. VERY DESIRABLE 

F 7.33 If the standardizing sample 
is too small to permit calculation of 
separate norms on scored variables at 
different levels of total response, the 
correlation of each of these with re- 
sponse level must be presented. Es- 
SENTIAL 

F 7.34 If correlation data suggest 
that the dependence of scores on total 
responsiveness is nonlinear, — this 
should be expiicitly stated and the 
user warned that linear corrections, 
prorating, or computing of percen- 
tages are inappropriate procedures. 
ESSENTIAL 

F 7.35 If data are insufficient to 
determine the nature of the depend- 
ence of the several scores upon re- 
sponsivity (such as linearity, array 
scatter), this lack of information 
should be explicitly mentioned and 
the possible dangers in interpretation 
should be stressed. ESSENTIAL 


F 7.4 If conditions affecting test 
scores are expected to change as time 
elapses, periodic review of norms is 
required. VERY DESIRABLE 

F 7.5 Some profile sheets record, 
side by side, scores from tests so 


standardized that different scores 
compare the person to different norm 
groups. Profiles of this type should 
be recommended for use only where 
tests are intended to assess or predict 
the person's standing in different 
situations, where he competes with 





38 Technical Recommendations 


the different groups. Where such 
mixed scales are compared, the fact 
that the norm groups differ should be 
made clear on the profile sheet. VERY 
DESIRABLE 

F 7.6 The description of the norm 
group should be sufficiently complete 
so that the user can judge whether 


his case falls within the population 


represented by the norm group. The 
description should include number of 
cases, classified by relevant variables 
such as age, sex, educational status, 
etc. ESSENTIAI 


(238) 


F 7.7 The conditions under which 
normative data were obtained should 
be reported. The conditions of test- 
ing, including the purpose of the sub- 
jects in taking the test, should be re- 
ported. ESSENTIAL 


[Comment: Some tests are standard- 
ized on job applicant groups, others on 
groups which have requested vocational 
guidance, and still others on groups which 
Mo- 
tivation for taking tests, test-taking atti- 
tudes, abilities, and personality charac- 
teristics possibly differ on all of these 


groups.] 


realized they were ‘‘guinea pigs.” 





“ 


